How to work with a Dataset?

There are two issues under discussion: 1) the distinct dataset visualizations for the same data and 2) ways to update dataset subelements in place. We will discuss these separately.

Distinct Dataset Visualizations

The way a dataset is displayed is sensitive to the data type of the dataset. That type, in turn, is sensitive to the history of the dataset. This is discussed at length in (143551). For the case at hand, we can see how the data type evolves with each AppendTo operation:

Needs["Dataset`"]
Needs["TypeSystem`"]

{ db = Dataset[{}]
, AppendTo[db,<|"a" -> 1, "b" -> 2|>]
, AppendTo[db,<|"a" -> 2, "b" -> 5|>]
, AppendTo[db,<|"x" -> 2, "y" -> 5|>]
} // Unevaluated // Map[{#, GetType[#]}&] // Grid[#, Frame->All, Alignment->Left]&

data type evolution of a dataset

The principal data type is a Vector of Assoc. The last row shows how adding the incompatible keys "x" and "y" switched the key type from Enumeration to the generic AnyType.

Now constrast this to db2:

db2 = Dataset[{<|"a" -> 1, "b" -> 2|>, <|"a" -> 2, "b" -> 5|>}]

simple dataset

db2 // GetType
(* Vector[Struct[{a, b}, {Atom[Integer], Atom[Integer]}], 2] *)

The principal data type is now a Vector of Struct. A "struct" represents the case when the dataset is known to contain associations of consistent type. It deduced this at the time that db2 was constructed.

In the case of db which was built incrementally, the type system infers the final data type from a combination of the initial data type and the type transformations of any applied operators (e.g. AppendTo). Such type inferencing is generally less specific than the type deduction that occurs at construction time. We can use Dataset as a query operator to force reconstruction of a dataset and thereby deduce its data type anew:

db = Dataset[{}];
AppendTo[db, <|"a" -> 1, "b" -> 2|>];
AppendTo[db, <|"a" -> 2, "b" -> 5|>]

incrementally constructed dataset

db // GetType
(* Vector[Assoc[Atom[Enumeration["a", "b"]], Atom[Integer], 2], 2] *)

db = db[Dataset]

reconstructed dataset

db // GetType
(* Vector[Struct[{"a", "b"}, {Atom[Integer], Atom[Integer]}], 2] *)

Updating Subelements of Datasets

There are presently very few ways to update a Dataset in place. See, for example, the discussion in (54491) or the work-around sketched in (141916). In particular, the kinds of update contemplated in the question are not presently supported.

The way to achieve such alterations presently is through query operators. For example, we can append a new key "c" to element 1:

db3 = db2[{1 -> Append["c" -> 7]}]

added "c" key

Note that by adding the key "c" to only one of the associations, the data type switched from Assoc to Struct and gave us the vertical key/value pair visualization we saw earlier.. If we had added "c" to all assocations, we would have retained the Struct tabular visualization:

db3 = db2[{1 -> Append["c" -> 7], 2 -> Append["c" -> 8]}]

tabular representation after adding keys

The closest thing to updating a dataset in place is expressed as db = db[...ops...].

It is possible to update a simple list of associations in place:

$list = db2 // Normal
(* {<|"a" -> 1, "b" -> 2|>, <|"a" -> 2, "b" -> 5|>} *)

$list[[1, "c"]] = 7;
$list
(* {<|"a" -> 1, "b" -> 2, "c" -> 7|>, <|"a" -> 2, "b" -> 5|>} *)

Closing Comments

Beware that performing large numbers of incremental changes to datasets will likely get progressively slower. This is the dataset analog to repeatedly applying AppendTo to a list, a strategy which exhibits a slow-down proportional to the square of the length of the list. The dataset infrastructure is best suited for operators that are applied to significant subsets of the dataset all at once (e.g. one or more complete columns).

The operation of the dataset type system is discussed in (89080). Choosing between datasets or associations is discussed in (87360)


Thank you both for the comment and the answer. They were very helpful. From them, and the postings referenced in the answer by @WReach, I came to the conclusion that a list of associations would best serve my purpose.

Here is what I came up with for functionality. (I deleted output because the association forms aren't very readable outside the Frontend. )

(* an empty database *)
db={};

(* add a record with a single key *)
AppendTo[db,<|"a"->1|>]

(* add a second record *)
AppendTo[db,<|"a"->2|>]

(* and a third *)
AppendTo[db,AssociationThread[{"a","b","c"},{5,7,9}]]

(* add a Key-value pair to the first record *)
AssociateTo[db[[1]],"b"->5]

(* modify a value *)
db[[1,"b"]]=7;db

(* total the "b" values, with Nothing for missing keys *)
Total@Lookup[db,"b",Nothing]

(* select records based on key value *)
Select[db,#["b"]==7&]

Tags:

Dataset