How to implement a discrete data representation?

Making discrete Data an abstract data type

I would try to do something like the following:

(* write a constructor function for the data type 'discrete data' *)
discreteData[ 
      scale_?NumericQ , 
      bias_?NumericQ , 
      tally : { { _?NumericQ , _?NumericQ } .. }
   ] := discreteData @ Association[
   "scale" -> scale,
   "bias" -> bias,
   "tally" -> tally
]

(* define upvalues *)
discreteData /: Length[ d_discreteData ] := With[
    { assoc = d[[1]] },
    Plus @@ ( Last /@ assoc[["tally"]] )
]

discreteData /: Normal[ d_discreteData ] := With[
    { assoc = d[[1]] },
    assoc[["bias"]] + assoc[[ "scale" ]] Flatten [  
        ConstantArray @@@ assoc[["tally"]] 
    ]
]

(* If you prefer, you can make Dataset the printed form of this data type *)
discreteData/: Format[ d_discreteData ] := With[
    { assoc = d[[1]] },
    Dataset @ assoc
]

Working with discreteData

Then this will work out as follows: We can construct disctreteData using the constructor function.

data = discreteData[ 
    1.234 , (* scale *)
    5.678 , (* bias *)
    { {-5, 2}, {-4, 251}, {-3, 5941}, {-2, 60383}, 
      {-1, 241185}, {0, 383613}, {1, 241644}, {2, 61035}, {3, 5686}, 
      {4,259}, {5, 1} } (* tally *) 
]

Dataset

While discreteData are displayed as a Dataset, they internally are simply an association given an expressive Head:

InputForm @ data

discreteData[<|"scale" -> 1.234, "bias" -> 5.678, "tally" -> {{-5, 2}, {-4, 251}, {-3, 5941}, {-2, 60383}, {-1, 241185}, {0, 383613}, {1, 241644}, {2, 61035}, {3, 5686}, {4, 259}, {5, 1}}|>]

We can extract information comfortably using Mathematica's built-in functions (e.g. Length, Normal).

Length @ data

1 000 000

Normal @ data

{-0.492, -0.492, 0.742, 0.742, 0.742,...,10.614, 10.614, 10.614, 10.614, 11.848}


General

A late-to-the party post. To complement other answers, which already show most important ingredients for constructing data types with overloaded system functions, I'd like to show an approach that follows the same ideas a couple steps further.

While UpValues are the tool to use, one problem with them is the lack of control or introspection: they leave no middle ground between the functions they overload and actual implementations for a specific data type. In some cases, more control over the overloaded behavior might be beneficial.

The examples below are based on the code of a tiny framework, which has more didactic goal and is not meant to be directly production-usable. That code is at the end of this post.

Defining an interface

Preliminaries

For the purposes of this example, we will define an interface ListObject, which should support versions of Length, Normal, Part, Take, Drop, Map and Select, as well as a type-testing predicate. The actual objects implementing this interface will look like

ListObject[objectContainer[inner-data]]

One observation here is that the operations one can perform on a list-like object can be broadly divided into non-terminal (resulting in an object of the same type) and terminal (resulting in anything else). I have wired this notion into the example framework.

So, assuming that code below (at the end of the post) has been executed, we define, as our first step:

Scan[SetTerminal, {Length, Normal}];
Scan[SetNonTerminal, {Part, Take, Drop, Map, Select}] 

ListObject interface

We now define the interface symbol and testing predicate:

ClearAll[ListObject];
ListObject::notimpl = "The function `1` has not been implemented for backend `2`";
ListObject::badargs = "Function `1` received bad number and / or types of arguments: `2`";

ClearAll[ListObjectQ];
ListObjectQ[ListObject[backend_Symbol[___]]]:=True;
ListObjectQ[_]:=False;

The following code sets generic definitions for the interface functions:

Scan[
   SetGeneralDef[ListObject], 
   {ListLength, ListNormal, ListPart, ListTake, ListDrop, ListSelect}
];
Scan[SetGeneralDef[ListObject][#, 2]&, {ListMap}];

where new symbols starting with List are part of the ListObject interface.

Next, we define the dispatch:

Def[ListObject] @ Length @ ListObject -> ListLength
Def[ListObject] @ Normal @ ListObject -> ListNormal
(* Force terminal behavior on Part[lst, _Integer] *)
Def[ListObject, False] @ Part[ListObject, _Integer] -> ListPart
Def[ListObject] @ Part[ListObject, __] -> ListPart
Def[ListObject] @ Take[ListObject, __] -> ListTake
Def[ListObject] @ Drop[ListObject, __] -> ListDrop
Def[ListObject] @ Map[_, ListObject, ___] -> ListMap
Def[ListObject] @ Select[ListObject, __] -> ListSelect

At this point, we are done defining an interface.

Notes

The above uses syntactic sugar, defined in the code below. What really happens is that definitions has been generated, which you can test by executing

?ListObject

For example, here is how a definition for Map may look:

Map[ff$_,ListObject[backend$_],rr$:PatternSequence[___]]^:=
   ListObject[With[{type$=ContainerType[backend$]},
     Postprocessor[type$,Map,ListObject][ListMap[
        Preprocessor[type$,Map,ListObject][ff$,backend$,rr$]
     ]]]]

These definitions connect Map called on ListObject[type[contents]] with the internal symbol ListMap, and also adds pre- and post-processor.

Adding a sample implementation

As an example, let us define a list-like object that would implement certain operations in a lazy fashion (very simplistically, for the sake of this example). Basically, the list object will look like

MyList[data, functions]

where data is the original data (a list), and functions is an (initially empty) list representing a queue of functions to be executed on original data. Most interface functions will add some data transformation to that queue (rather than actually doing the computation), and when one calls Normal, only then all the accumulated functions in the queue get applied to data, in the order they stand in functions list.

Implementation

Here is the code:

ClearAll[MyList];
ApplyDelayed[f_, MyList[data_List, funs_List:{}]]:=
  MyList[data, Prepend[funs, f]];
MyList /: ListNormal[MyList[data_List, funs_List:{}]]:=
  Apply[Composition, funs] @ data;
MyList /: ListLength[lst_MyList]:=Length @ ListNormal @ lst;
MyList /: ListPart[MyList[data_List, funs_List:{}], p_Integer]:= 
  Apply[Composition, funs] @ data[[p]];
MyList /: ListPart[lst_MyList, spec__]:= ApplyDelayed[Part[#, spec]&, lst];
MyList /: ListTake[lst_MyList, spec_]:= ApplyDelayed[Take[#, spec]&, lst];
MyList /: ListDrop[lst_MyList, spec_]:= ApplyDelayed[Drop[#, spec]&, lst];
MyList /: ListMap[f_, lst_MyList]:=ApplyDelayed[Map[f], lst];
MyList /: ListSelect[lst_MyList, pred_]:= ApplyDelayed[Select[pred], lst];  

Examples

Here, for example, we square the list, and then select odd elements. You can see that function application simply adds operations to the queue, and everything gets evaluated only when we call Normal:

Select[Map[#^2 &, ListObject[MyList[Range[10]]]], OddQ]
Normal @ Select[Map[#^2 &, ListObject[MyList[Range[10]]]], OddQ]

(* 
   ListObject[MyList[{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, {Select[OddQ], Map[#1^2 &]}]]

   {1, 9, 25, 49, 81}
*)

A somewhat larger example:

Take[Drop[Select[Map[#^2 &, ListObject[MyList[Range[20]]]], OddQ], 2], 5]

(* 
   ListObject[
     MyList[
       {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, 
       {Take[#1, 5] &, Drop[#1, 2] &, Select[OddQ], Map[#1^2 &]}
     ]
   ]
*)

Here is one way to illustrate the work of the post-processor: the code below redefines the default (trivial) post-processor for all implemented functions for type MyList:

MyList /: Postprocessor[MyList, f_, _] := Function[Print[HoldForm[f -> #]]; #]

We can now execute the same code:

Normal @ Select[Map[#^2 &, ListObject[MyList[Range[10]]]], OddQ]
During evaluation of In[605]:= Map->MyList[{1,2,3,4,5,6,7,8,9,10},{Map[#1^2&]}]

During evaluation of In[605]:= Select->MyList[{1,2,3,4,5,6,7,8,9,10},{Select[OddQ],Map[#1^2&]}]

During evaluation of In[605]:= Normal->{1,9,25,49,81}
(* {1, 9, 25, 49, 81} *)

We now remove the extra definition we added:

MyList /: Postprocessor[MyList, f_, _] =.

One can (re) define post-processors more granularly, on the level of individual functions, for example:

ListMap /: Postprocessor[MyList, ListMap, _]:=...

One can obviously do more interesting things than just printing. For example, catching some errors / exceptions, or inserting non-trivial data post-processing.


Let's now see what happens if we try using some types, unknown to the system:

Normal[ListObject[UnknownList[stuff]]]
During evaluation of In[606]:= ListObject::notimpl: The function ListNormal has not been implemented for backend UnknownList

During evaluation of In[606]:= Throw::nocatch: Uncaught Throw[$Failed,Error[ListNormal]] returned to top level.
(* Hold[Throw[$Failed, Error[ListNormal]]] *)

This is in contrast to the usual behavior:

Normal[UnknownList[stuff]]

(* UnknownList[stuff] *)

which is much softer, but such softness isn't always good - often we actually may prefer to get an error instead of just silently returning an expression back.

What's the difference?

One may ask how the above construct fundamentally differs from what has been already suggested. The main difference is that here, we really defined an interface rather than an abstract data type. An interface does not really provide specific implementation, rather it provides a contract that all compatible implementations should meet.

The core mechanism here is the same as before - using UpValues. However, the above scheme gives one more control, essentially because we go though an intermediate step: instead of doing, for example, Map[MyList] -> MyList implementation, we do Map[ListObject[MyList]] -> ListMap[MyList] -> MyList implementation. So, the main features of this approach are then:

  • All code execution for all implemented types now goes through a single intermediate step (one of the List... functions).
  • There are ways to intercept the computations, facilitated by pre- and post-processors.
  • We can have a harder (or, generally, more customizable) default behavior in terms of error-handling and reporting.

I would not necessarily bother with such more complex construction for a single throw-away data type, but if I were planning to build an extensible framework where the users may implement new types, then having something like this may provide advantages for both myself as a framework author, and the users.

For example, should I want to change the behavior of some of the functions for all implementations of ListObject, all I have to do is to change the definition of the mapping of system symbol to interface symbol (e.g. Map to ListMap) - in the simplest case, by (re)defining pre- and / or post-processor. I would not need to touch any of the user implementations of ListObject interface for that.

The code for the interface generation

Interface-generation micro-framework

ClearAll[MsgFail];
SetAttributes[MsgFail, HoldRest];
MsgFail[f_, msg_:None, args___]:= (msg; Throw[$Failed, Error[f, args]]);

ClearAll[TerminalQ, NonTerminalQ];
TerminalQ[_] = NonTerminalQ[_]=False;

(* Wraps non-terminal operations with the container head *)
ClearAll[ObjectWrapper];
ObjectWrapper[obj_Symbol, op_, flag_]:= If[TrueQ @ flag, obj, Identity];
ObjectWrapper[obj_Symbol, op_?TerminalQ]:=Identity;
ObjectWrapper[obj_Symbol, op_?NonTerminalQ]:= obj;
ObjectWrapper[___]:=MsgFail[ObjectWrapper];

ClearAll[SetTerminal, SetNonTerminal];
SetTerminal[op_, flag_:True]:=TerminalQ[op]:=flag;
SetNonTerminal[op_, flag_:True]:=NonTerminalQ[op]:=flag;

(* Helper function for code generation for functions' definitions *)
ClearAll[defWrapped];
SetAttributes[defWrapped, HoldAll];
defWrapped[obj_, {f_, flag_:None},  TagSetDelayed[s_, lhs_, rhs_]]:=
  With[{wrapper = ObjectWrapper[obj, f, If[flag === None, Sequence @@ {}, flag]]},
      TagSetDelayed[s, lhs, wrapper @ rhs]; 
  ];

ClearAll[ContainerType];
ContainerType[c_Symbol[___]]:=c;
ContainerType[_]:=Null;

ClearAll[Preprocessor, Postprocessor];
Preprocessor[type_, function_, interface_]:=##&;
Postprocessor[type_, function_, interface_]:=##&;

(* Syntactic sugar for interface function dispatch definitions *)
ClearAll[Def];
SetAttributes[Def, HoldAll];
Def[obj_, flag:True|False|None:None]:=Function[code, Def[obj, code, flag], HoldAll];
Def /: Def[obj_, f_[obj_], flag_:None] ->  impl_:= 
  defWrapped[obj,{f, flag}, obj /: f[obj[backend_]]:= With[{type = ContainerType[backend]},
      Postprocessor[type, f, obj] @ impl[Preprocessor[type, f, obj] @ backend]
  ]];
Def /: Def[obj_, f_[obj_, args__], flag_:None] ->  impl_:= 
  defWrapped[obj, {f, flag}, obj /: f[obj[backend_], p:PatternSequence[args]]:=
      With[{type = ContainerType[backend]},
          Postprocessor[type, f, obj] @ impl[Preprocessor[type, f, obj][backend, p]]
      ]
  ];
Def /: Def[obj_, f_[fst_, obj_], flag_:None] ->  impl_:=
  defWrapped[obj, {f, flag}, obj /: f[ff:fst, obj[backend_]]:= 
    With[{type = ContainerType[backend]},
        Postprocessor[type, f, obj] @ impl[Preprocessor[type, f, obj][ff, backend]]
    ]
  ];
Def /: Def[obj_, f_[fst_, obj_, rest__], flag_:None] ->  impl_:=
  defWrapped[obj, {f, flag}, obj /: f[ff:fst, obj[backend_], rr:PatternSequence[rest]]:= 
    With[{type = ContainerType[backend]},
      Postprocessor[type, f, obj] @ impl @ Preprocessor[type, f, obj][ff, backend,rr]
    ]
  ]

(* General catch-all definitions for interface functions *)
ClearAll[SetGeneralDef];
SetGeneralDef[objsym_, clear_:True][f_Symbol, pos_Integer:1]:=
  With[{n = pos - 1},
    If[TrueQ[clear], ClearAll[f]];
    f[Repeated[_, {n}], backend_Symbol[___],args___]:= 
        MsgFail[f, Message[MessageName[objsym, "notimpl"], f, backend]];
    f[args___]:=MsgFail[f, Message[MessageName[objsym,"badargs"], f, {args}]];
  ];

Code for ListObject interface generation in one peace, for convenience

Execute the above code for interface generation first.

Scan[SetTerminal, {Length, Normal}];
Scan[SetNonTerminal, {Part, Take, Drop, Map, Select}] 

ClearAll[ListObject];
ListObject::notimpl = "The function `1` has not been implemented for backend `2`";
ListObject::badargs = "Function `1` received bad number and / or types of arguments: `2`";

ClearAll[ListObjectQ];
ListObjectQ[ListObject[backend_Symbol[___]]]:=True;
ListObjectQ[_]:=False;

Scan[
   SetGeneralDef[ListObject], 
   {ListLength, ListNormal, ListPart, ListTake, ListDrop, ListSelect}
];
Scan[SetGeneralDef[ListObject][#, 2]&, {ListMap}];

Def[ListObject] @ Length @ ListObject -> ListLength
Def[ListObject] @ Normal @ ListObject -> ListNormal
(* Force terminal behavior on Part[lst, _Integer] *)
Def[ListObject, False] @ Part[ListObject, _Integer] -> ListPart
Def[ListObject] @ Part[ListObject, __] -> ListPart
Def[ListObject] @ Take[ListObject, __] -> ListTake
Def[ListObject] @ Drop[ListObject, __] -> ListDrop
Def[ListObject] @ Map[_, ListObject, ___] -> ListMap
Def[ListObject] @ Select[ListObject, __] -> ListSelect

I'm just throwing a different idea in the room. Let's say you have a dataset of such structures:

ds = Dataset[{<|"scale" -> 1.234, "bias" -> 5.678, 
     "tally" -> {{-5, 2}, {-4, 251}, {-3, 5941}, {-2, 60383}, {-1, 
        241185}, {0, 383613}, {1, 241644}, {2, 61035}, {3, 5686}, {4, 
        259}, {5, 1}}|>,
    <|"scale" -> 1.234, "bias" -> 5.678, 
     "tally" -> {{-5, 2}, {-4, 251}, {-3, 5941}, {-2, 60383}, {-1, 
        241185}, {0, 383613}, {1, 241644}, {2, 61035}, {3, 5686}, {4, 
        259}, {5, 1}}|>,
    <|"scale" -> 1.234, "bias" -> 5.678, 
     "tally" -> {{-5, 2}, {-4, 251}, {-3, 5941}, {-2, 60383}, {-1, 
        241185}, {0, 383613}, {1, 241644}, {2, 61035}, {3, 5686}, {4, 
        259}, {5, 1}}|>
    }];

Would it be sufficient if you define your functions like this:

length = (Plus @@ (Last /@ #"tally")) &;
recover = (#"bias" + #"scale" Flatten[ConstantArray @@@ #"tally"]) &

and now, you can call

ds[All, length]

Mathematica graphics

or even

ds[1, recover]

Mathematica graphics