Select only first n elements in operator form?

Alas, no -- at least not as long as we wish to use Select as a descending Dataset/Query operator. The reason is that the query compiler will only recognize Select as a descending query operator when it has exactly one argument:

Needs["Dataset`"]

DescendingQ[Select[#==1&]]

(* True *)

Any attempt to curry other arguments will cause the expression to be interpreted as an ascending operator instead, as is the case for all functions that are not on the blessed list of descending operators:

op1 = Select[#, #==1&, 1000]&;
op2[c_, n_][l_] := Select[l, c, n]
op3[c_, n_] := Select[#, c, n] &
op4 = Curry[Select, {2, 3}];
op5 = Curry[{2, 3}][Select];

AscendingQ /@
  { op1
  , op2[#==1&, 1000]
  , op3[#==1&, 1000]
  , op4[#==1&, 1000]
  , op5[#==1&, 1000]
  }

(* {True, True, True, True, True} *)

We can see the difference using a small dataset:

ds = Range[10] // Dataset;

Contrast the action of the descending version of Select...

ds[Select[# < 4&], # + 100&]

dataset screenshot

... with that of the ascending version:

ds[Select[#, # < 4 &, 3] &, # + 100 &]

dataset screenshot

In the descending version, the elements are first filtered and then added to 100. In the ascending version, the elements are first added to 100 and then filtered.

We can often work around this situation by issuing consecutive queries:

ds[Select[#, # < 4 &, 3]&][All, # + 100&]

dataset screenshot

or by using subqueries:

ds[Select[#, # < 4 &, 3] & /* Query[All, # + 100 &]]

dataset screenshot

(A pedantic note: the All operators in these last queries are not strictly necessary given the listability of Plus, but they illustrate the general principle.)

It is a shame that the query compiler does not have some special treatment for the Curry operator that was introduced in version 11.3. It could be used to supplement and/or re-order the arguments to the various specially-recognized descending operators (especially Select, SelectFirst, and GroupBy). Perhaps in some future version...


If you want to keep the operator form, you could define a helper function to do this:

SelectSubset[crit_, n_][expr_] := Select[expr, crit, n]

Then:

r1 = dataset[SelectSubset[#a == 1&, 1000]]; //RepeatedTiming
r2 = Select[dataset, #a == 1&, 1000]; //RepeatedTiming

Normal@r1 === Normal@r2

{0.0029, Null}

{0.0030, Null}

True


Of course:

dataset[Select[#a == 1 &]][;; 1000]; // RepeatedTiming // First
Select[dataset, #a == 1 &, 1000]; // RepeatedTiming // First
dataset[data \[Function] Select[data, #a == 1 &, 1000]]; // RepeatedTiming // First

0.53

0.0030

0.0027

(Well, I don't know of a built-in method, but as it can be done like this, why should there be an extra built-in method? In the end, the operator forms are mere syntax sugar - sugar that has to be paid by a lot of documentation. (Btw., I usually do not hesitate to write longer code for better performance.))

Tags:

Dataset