How to make stream reduce be thread safe?
Ordinarily, accumulator
is an english word that means: "You are completely hosed if you want parallelism". It's right there in the word: To accumulate - to gather over time. There is no way to do it right except to start from the beginning, and apply accumulation until you are done.
But, java gets around this by adding 2 requirements:
- associativity.
a X (b X c)
must produce the same result as(a X b) X c
, where X is the accumulator function. - identity function.
ident X a
must be equal toa
, whereident
is the identity you pass toreduce
and X is the accumulator function.
Let's use as example the function (a, b) -> a + b
and as identity 0
, which fulfills both of these requirements if your intent is to sum a list.
Java can parallellize this by just summing arbitrary terms and then summing the results of these. [1, 5, 9, 12]
can be summed by first lopping the list into two, then handing these 2 sublists to threads to individually sum, and then summing the answers each thread provides. This implies that java will start accumulation multiple times at arbitrary points in the stream, and will apply identity as part of its accumulation any number of times, at arbitrary points, and that brings swift problems if your identity object is itself mutable.
There's basically no way to combine the notion of a mutable identity
object and java's reduce
function. It is fundamentally not designed to work that way.
Contrast to the sum example: Instead of modifying a in the (a, b) -> a + b
accumulator, neither a nor b are modified; instead, they are combined into a newly created third value, and that's how you should use this method.
Contrast to foldLeft
from certain other languages, which do not require either accumulatorFunction(ident, A)
to be equal to A, nor associativity, but then cannot by definition parallellize it, at all. That foldLeft can be used with mutable state. For example, here is an impl of summing using a foldLeft, in pseudocode: (note that new int[1]
is used here as mutable integer):
int sum = stream.foldLeft(new int[1], (int[] a, int b) -> a[0] += b)[0];
This notion (where the LHS of your accumulator function is always the same thing, namely, your identity object, being modified to integrate each value in the stream as you move along it) is not compatible with java's reduce, and as far as I can recall, java has no (easy) way to do this kind of thing to a stream.
Thus: It's worse! 'thread safe' isn't good enough, it needs to be immutable. Once it is immutable, it is trivially thread safe.
is it enough just to make identity object immutable and return a new instance upon each reduce?
That's not just 'good enough', that's more or less the only sane way to use reduce
.
This is covered by the documentation, but not directly, it is implied.
The identity value must be an identity for the accumulator function. This means that for all t, accumulator.apply(identity, t) is equal to t.
As soon as identity
is modified, like you say, even if in a thread-safe way, the rule above is violated; thus no guarantees of the expected result.
For the second question the answer is slightly more involved. You do not have to make the identity
immutable, as long as no one abuses that (by modifying its internal state). Of course making it immutable
helps a lot in that regard.