Bad Float Counts? (Length[xs] != Total@Counts[xs])
Update:
Counts[data]
seems to be equivalent to AssociationThread @@ Transpose@Tally[data]
. The OP's problem arises because in constructing an Association
, keys are checked for uniqueness and later entries with a duplicate key replace earlier entries. (Simple example: Association[{1. -> 1, 1. -> 2}]
.) Uniqueness is determined by MatchQ
, I believe, which has problems discussed below in the original answer. The problem with SameQ
not strictly being an equivalence relation due nontransitivity is still an issue. This update principally clarifies the role of forming an association: It discards entries of the Tally
with duplicate keys, which results in an undercount.
Original answer:
Working with floating-point numbers is tricky. I'd say the most important, common issue is that rounding errors lead to different but close numbers that users wish would be treated the same. Introductory programming courses teach that comparing floats should be done with something like Abs[x - y] < $MyTolerance
. In Mathematica similar (but relative) tolerances are built into SameQ
and Equal
, which are controlled by the the internal system parameters Internal`$SameQTolerance
and Internal`$EqualTolerance
respectively (see also this; this question has similar issues as the OP). Perhaps less well known is that MatchQ
has a small tolerance like SameQ
but is slightly more restrictive. The most important difference is that MatchQ
is transitive but SameQ
is not.
These functions play various roles in pattern-matching and comparing numbers, and their issues affect functions like Counts[]
when applied to floating-point data. When constructing classes from data, as in Counts[]
, some reflection should lead one to think that using a comparison that is an equivalence relation, and therefore transitive, would be desirable. And if not transitive, it should be at least "locally transitive" on the actual data being used. By "locally transitive," I mean that the relation is transitive when restricted to the data set, even if it is not transitive on all floating-point numbers.
This seems to be the problem with the OP's example: SameQ
is not transitive on xs
. I say "seems" because I cannot check the internal workings of Counts[]
. It's possible that SameQ
is used to construct the keys and MatchQ
to tally the counts. It seems to me that Counts[]
uses SameQ
and CountsBy[]
uses MatchQ
to construct the keys, however the counting is done. Since SameQ
is not transitive, the order that the data is processed can make a difference. Nevertheless, the problem can be fixed by making SameQ
locally transitive on xs
.
The reason this comes up in this example is that in constructing the data xs
, the rounding drift amounts to 2 bits (2 ulp), which is one too many for SameQ
to be locally transitive.
Here is the best fix (that fixes SameQ
-- Chip Hurst points out that the tolerance can be as small as 0.55
, which is close to the value of Log10[4.] = 0.60..
that would be predicted by the observed rounding error):
Block[{Internal`$SameQTolerance = Internal`$EqualTolerance},
Total@Counts[xs]]
(* 100 *)
Another fix is to use MatchQ
via CountsBy[]
:
Total@CountsBy[xs, # &]
(* 100 *)
The CountsBy[]
association has several keys for equal numbers, but it does have the correct total. The first solution seems better because it has one key for each cluster of equal numbers. (It likely that in some applications one would like distinct floating-point numbers to have distinct entries; then something like CountsBy[xs, ToString@*FullForm]
would do the trick.)
Appendix
Some pictures showing the properties of SameQ
and MatchQ
on consecutive machine-precision floats:
Block[{x1},
x1 = Table[1 + n*$MachineEpsilon, {n, 0, 5}];
{Outer[Boole@*SameQ, x1, x1] // MatrixPlot[#, PlotLabel -> SameQ] &,
Outer[Boole@*MatchQ, x1, x1] // MatrixPlot[#, PlotLabel -> MatchQ] &} //
GraphicsRow
]
Total@Counts@Chop@xs
100
CountsBy[xs, N]
CountsBy[xs, N] // Total
100
Maybe the precision caused the problem during counting.
keys = Keys@Counts[xs];
res = xs /. Thread[keys -> ConstantArray[0, Length@keys]]//FullForm