Dataset and Select with a SmoothKernelDistribution
Short, Short Version
A General::stop
message will cause a dataset query to fail if some other explicitly quieted message occurs three or more times. For example:
Dataset[1][Quiet[Do[Message[f::ivar, #], 2]; #, f::ivar] &]
(* 1 -- success: only two quieted f::ivar messages generated *)
Dataset[1][Quiet[Do[Message[f::ivar, #], 3]; #, f::ivar] &]
(* Failure[...] -- third f:ivar message triggers a General::stop and failure *)
Short Version
The behaviour we see is due to an unfortunate combination of circumstances:
- Generating the PDF of a
SmoothKernelDistribution
generates some messages which are internally suppressed usingQuiet
. - Distribution results are routinely cached so that the messages in question are only generated during the first evaluation.
FailureAction -> "Abort"
is the default setting forDataset
queries and it causes queries to fail if any unexpected messages are generated. A message is unexpected if it has not been explicitly quieted by code executed within the body of the query.PDF[...]
is generating so manyPart::partd
messages that the system issues aGeneral::stop
message. Normally, aGeneral::stop
receives special treatment and quieted along with the messages that generated it, but the hypersensitiveFailureAction
machinery classifies it as an unexpected message and fails the query.
Work-around
A work-around is to turn off the special message handling from the Dataset
query:
ds[Select[func] /* Length, FailureAction -> None]
(* 397 *)
Is It A Bug?
Yes, I think so. Given that in normal circumstances General::stop
is only generated after some other message has appeared for the third time, the bug would be corrected if General::stop
was always presumed to be quieted by GeneralUtilities`MessageQuietedQ
(see discussion below).
Spelunking Report (current as of version 11.0.1)
... be prepared for too much complicated trivia ...
Let's begin by finding a minimal set of steps to reproduce the behaviour:
d = SmoothKernelDistribution[{{1.,2.}}];
Dataset[{1., 2.}][PDF[d, #]&]
(* Failure[...] *)
PDF[d, {1., 2.}]
(* 0.390901 *)
Dataset[{1., 2.}][PDF[d, #]&]
(* 0.390901 *)
d = SmoothKernelDistribution[{{1.,2.}}];
Dataset[{1., 2.}][PDF[d, #]&]
(* Failure[...] *)
Notice how the bad behaviour disappears after performing a non-query evaluation and how it reappears after reassigning a newly created distribution to d
. (bonus trivia: using the operator form PDF[d]
always fails as it appears to defeat the cache)
PDF Evaluation Generates Messages
By blocking the action of Quiet
, we can observe that PDF[d, {1., 2.}]
generates messages that would normally be quieted:
d = SmoothKernelDistribution[{{1.,2.}}];
Block[{Quiet}, PDF[d, {1., 2.}]]
(*
>> Part::partd: Part specification $x[[1]] is longer than depth of object.
>> Part::partd: Part specification $x[[2]] is longer than depth of object.
>> Part::partd: Part specification $x[[1]] is longer than depth of object.
>> General::stop: Further output of Part::partd will be suppressed during this calculation.
0.390901
*)
... but only the first time:
Block[{Quiet}, PDF[d, {1., 2.}]]
(* 0.390901 *)
We can use an alternative technique to view considerably more details about the generated messages and their evaluation contexts:
d = SmoothKernelDistribution[{{1.,2.}}];
Internal`HandlerBlock[{"Message",Print[Internal`QuietStatus[]]&}, PDF[d, {1., 2.}]]
(*
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458}},Check->None}
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458},{Part::partd,1458}},Check->None}
...
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458},{Part::partd,1458},{Part::partd,1458},{General::stop,1458}},Check->None}
{Global->Unquiet,Off->{Part::partd},On->{},Stack->{{1458,{Part::partd},{}}},MessageList->{{Part::partd,1458},{Part::partd,1458},{Part::partd,1458},{General::stop,1458}},Check->None}
0.390901
*)
We will shortly see that the Stack
values from Internal`QuietStatus
are relevant.
Distribution Calculations Use A Cache
The presence of a cache is strongly hinted by the fact that the messages disappear during the first evaluation of PDF
but reappear if we regenerate the distribution itself. We can confirm this guess by tracing calls to StoreDataDistributionExpression
:
On[Statistics`DataDistributionUtilities`StoreDataDistributionExpression]
System`Dump`DeactivateReadProtected[{DataDistribution, PDF}
, Print["#### define"]; d = SmoothKernelDistribution[{{1.,2.}}]
; Print["#### pdf1"]; PDF[d,{1.,2.}]
; Print["#### pdf2"]; PDF[d,{1.,2.}]
]
Off[]
(* Output:
#### define
... messages showing calls to StoreDataDistributionExpression ...
#### pdf1
... messages showing more calls to StoreDataDistributionExpression ...
#### pdf2
... no messages! ...
*)
Observe how the initial distribution generation tucked some values into a cache, as did the first PDF evaluation. But the second PDF evaluation did not store into the cache at all.
FailureAction Treats Messages As Query Failure
Now, we can move on to FailureAction
in dataset queries. As noted earlier, the default action is to cause a query to fail should any messages be generated:
Dataset[1][(Message[f::ivar, #]; #) &]
(* Failure[...] *)
But quieted messages do not (normally) cause failure:
Dataset[1][Quiet[Message[f::ivar, #]; #] &]
(* 1 *)
We can also turn off the failure processing:
Dataset[1][(Message[f::ivar, #]; #) &, FailureAction -> None]
(*
>> f: 1 is not a valid variable
1
*)
Unexpected Messages Trigger FailureAction
But if the previous example shows that quieted messages do not trigger FailureAction
, what is so special about the quieted messages we observed for this:
d = SmoothKernelDistribution[{{1.,2.}}];
Dataset[{1., 2.}][PDF[d, #]&]
(* Failure[...] *)
To find the answer, we need to closely inspect a voluminous trace of the evaluation (not reproduced here). Deep in the bowels of that trace, we find that the relevant functions are:
Needs["GeneralUtilities`"]
{ EvaluateChecked, GeneralUtilities`Failure`PackagePrivate`CheckedHandler
, MessageStackID, MessageQuietedQ
} // Scan[PrintDefinitionsLocal]
In short, EvaluateChecked
begins by setting up an environment which will catch and handle any messages. MessageStackID
is used to identify the outermost stack boundary of that environment. Should a message appear, CheckedHandler
will inspect the message to determine if it should be ignored. Any message that is not MessageQuietedQ
will cause a failure. MessageQuietedQ
uses the Internal`QuietStatus[]
output we saw earlier to see if the message has been explicitly quieted by code within the bounded stack environment.
Putting It All Together
In the case at hand, PDF[...]
explicitly quiets the message Part::partd
. But is does not explicitly quiet General::stop
.
And so, to finally reach the end of our long-winded shaggy dog story... it is the General::stop
message that causes the query to fail. That message only appears the first time the PDF function is generated because the function is cached on subsequent attempts.
Workaround with Efficiency Improvement
Another approach, which also makes your code more efficient, is to create the PDF
of the SmoothKernelDistribution
only once.
When written as
func = PDF[kd, {#"a", #"b"}] >= kernelProbability &;
ds[Select[func] /* Length]
kd
is resolved into its PDF for each row in the Dataset
because, as Attributes
tells us, Function
has attribute HoldAll
. Here it is not too complicated for PDF
to resolve kd
into its PDF so you don't really notice this.
However, we can resolve the PDF of kd
only once and then evaluate it repeatedly in the Query
.
ClearAll[x, y];
pdf[x_, y_] = PDF[kd, {x, y}];
func1 = pdf[#"a", #"b"] >= kernelProbability &;
ds[Select[func1] /* Length]
pdf
is Set
to the resolved PDF of kd
outside of the Select
. Now for each row the resolved PDF (pdf
) is evaluated. The overhead of resolving it for each row is removed. This not only makes your code more efficient but bypasses the bug as well.
In general be aware of the cost of your expression when doing row-by-row evaluations.
Hope this helps.