Why is zipped faster than zip in Scala?
None of the other answers mention the primary reason for the difference in speed, which is that the zipped
version avoids 10,000 tuple allocations. As a couple of the other answers do note, the zip
version involves an intermediate array, while the zipped
version doesn't, but allocating an array for 10,000 elements isn't what makes the zip
version so much worse—it's the 10,000 short-lived tuples that are being put into that array. These are represented by objects on the JVM, so you're doing a bunch of object allocations for things that you're immediately going to throw away.
The rest of this answer just goes into a little more detail about how you can confirm this.
Better benchmarking
You really want to be using a framework like jmh to do any kind of benchmarking responsibly on the JVM, and even then the responsibly part is hard, although setting up jmh itself isn't too bad. If you have a project/plugins.sbt
like this:
addSbtPlugin("pl.project13.scala" % "sbt-jmh" % "0.3.7")
And a build.sbt
like this (I'm using 2.11.8 since you mention that's what you're using):
scalaVersion := "2.11.8"
enablePlugins(JmhPlugin)
Then you can write your benchmark like this:
package zipped_bench
import org.openjdk.jmh.annotations._
@State(Scope.Benchmark)
@BenchmarkMode(Array(Mode.Throughput))
class ZippedBench {
val arr1 = Array.fill(10000)(math.random)
val arr2 = Array.fill(10000)(math.random)
def ES(arr: Array[Double], arr1: Array[Double]): Array[Double] =
arr.zip(arr1).map(x => x._1 + x._2)
def ES1(arr: Array[Double], arr1: Array[Double]): Array[Double] =
(arr, arr1).zipped.map((x, y) => x + y)
@Benchmark def withZip: Array[Double] = ES(arr1, arr2)
@Benchmark def withZipped: Array[Double] = ES1(arr1, arr2)
}
And run it with sbt "jmh:run -i 10 -wi 10 -f 2 -t 1 zipped_bench.ZippedBench"
:
Benchmark Mode Cnt Score Error Units
ZippedBench.withZip thrpt 20 4902.519 ± 41.733 ops/s
ZippedBench.withZipped thrpt 20 8736.251 ± 36.730 ops/s
Which shows that the zipped
version gets about 80% more throughput, which is probably more or less the same as your measurements.
Measuring allocations
You can also ask jmh to measure allocations with -prof gc
:
Benchmark Mode Cnt Score Error Units
ZippedBench.withZip thrpt 5 4894.197 ± 119.519 ops/s
ZippedBench.withZip:·gc.alloc.rate thrpt 5 4801.158 ± 117.157 MB/sec
ZippedBench.withZip:·gc.alloc.rate.norm thrpt 5 1080120.009 ± 0.001 B/op
ZippedBench.withZip:·gc.churn.PS_Eden_Space thrpt 5 4808.028 ± 87.804 MB/sec
ZippedBench.withZip:·gc.churn.PS_Eden_Space.norm thrpt 5 1081677.156 ± 12639.416 B/op
ZippedBench.withZip:·gc.churn.PS_Survivor_Space thrpt 5 2.129 ± 0.794 MB/sec
ZippedBench.withZip:·gc.churn.PS_Survivor_Space.norm thrpt 5 479.009 ± 179.575 B/op
ZippedBench.withZip:·gc.count thrpt 5 714.000 counts
ZippedBench.withZip:·gc.time thrpt 5 476.000 ms
ZippedBench.withZipped thrpt 5 11248.964 ± 43.728 ops/s
ZippedBench.withZipped:·gc.alloc.rate thrpt 5 3270.856 ± 12.729 MB/sec
ZippedBench.withZipped:·gc.alloc.rate.norm thrpt 5 320152.004 ± 0.001 B/op
ZippedBench.withZipped:·gc.churn.PS_Eden_Space thrpt 5 3277.158 ± 32.327 MB/sec
ZippedBench.withZipped:·gc.churn.PS_Eden_Space.norm thrpt 5 320769.044 ± 3216.092 B/op
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space thrpt 5 0.360 ± 0.166 MB/sec
ZippedBench.withZipped:·gc.churn.PS_Survivor_Space.norm thrpt 5 35.245 ± 16.365 B/op
ZippedBench.withZipped:·gc.count thrpt 5 863.000 counts
ZippedBench.withZipped:·gc.time thrpt 5 447.000 ms
…where gc.alloc.rate.norm
is probably the most interesting part, showing that the zip
version is allocating over three times as much as zipped
.
Imperative implementations
If I knew that this method was going to be called in extremely performance-sensitive contexts, I'd probably implement it like this:
def ES3(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
val minSize = math.min(arr.length, arr1.length)
val newArr = new Array[Double](minSize)
var i = 0
while (i < minSize) {
newArr(i) = arr(i) + arr1(i)
i += 1
}
newArr
}
Note that unlike the optimized version in one of the other answers, this uses while
instead of a for
since the for
will still desugar into Scala collections operations. We can compare this implementation (withWhile
), the other answer's optimized (but not in-place) implementation (withFor
), and the two original implementations:
Benchmark Mode Cnt Score Error Units
ZippedBench.withFor thrpt 20 118426.044 ± 2173.310 ops/s
ZippedBench.withWhile thrpt 20 119834.409 ± 527.589 ops/s
ZippedBench.withZip thrpt 20 4886.624 ± 75.567 ops/s
ZippedBench.withZipped thrpt 20 9961.668 ± 1104.937 ops/s
That's a really huge difference between the imperative and functional versions, and all of these method signatures are exactly identical and the implementations have the same semantics. It's not like the imperative implementations are using global state, etc. While the zip
and zipped
versions are more readable, I personally don't think there's any sense in which the imperative versions are against the "spirit of Scala", and I wouldn't hesitate to use them myself.
With tabulate
Update: I added a tabulate
implementation to the benchmark based on a comment in another answer:
def ES4(arr: Array[Double], arr1: Array[Double]): Array[Double] = {
val minSize = math.min(arr.length, arr1.length)
Array.tabulate(minSize)(i => arr(i) + arr1(i))
}
It's much faster than the zip
versions, although still much slower than the imperative ones:
Benchmark Mode Cnt Score Error Units
ZippedBench.withTabulate thrpt 20 32326.051 ± 535.677 ops/s
ZippedBench.withZip thrpt 20 4902.027 ± 47.931 ops/s
This is what I'd expect, since there's nothing inherently expensive about calling a function, and because accessing array elements by index is very cheap.
To answer your second question:
Is there any more faster way to do element wise operation on a collection in Scala?
The sad truth is that despite it's conciseness, improved productivity, and resilience to bugs, is that functional languages aren't necessarily the most performant. In the OP's example, using higher order functions to define a projection to be executed against collections has overhead, and the tight loop amplifies this. As others have pointed out, additional storage allocation for intermediate and final results will also have overhead.
If performance is critical, although by no means universal, in cases like yours you can unwind concise functional code back into imperative equivalents in order to regain more direct control over memory usage and eliminating function call overheads.
In your specific example, the zipped
sums can be performed imperatively by pre-allocating a fixed, mutable array of correct size (since zip stops when one of the collections runs out of elements), and then adding elements at the appropriate index together (since accessing array elements by ordinal index is a very fast operation).
By example, adding a third function, ES3
to your test suite:
def ES3(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
val minSize = math.min(arr.length, arr1.length)
val array = Array.ofDim[Double](minSize)
for (i <- 0 to minSize - 1) {
array(i) = arr(i) + arr1(i)
}
array
}
On my i7 I get the following response times:
OP ES Total Time Consumed:23.3747857Seconds
OP ES1 Total Time Consumed:11.7506995Seconds
--
ES3 Total Time Consumed:1.0255231Seconds
Even more heineous would be to do direct in-place mutation of the shorter of the two arrays, which would obviously corrupt the contents of this shorter array, so should only be implemented if the original arrays wouldn't be needed for further work by the caller:
def ES4(arr :Array[Double], arr1 :Array[Double]) :Array[Double] = {
val minSize = math.min(arr.length, arr1.length)
val array = if (arr.length < arr1.length) arr else arr1
for (i <- 0 to minSize - 1) {
array(i) = arr(i) + arr1(i)
}
array
}
Total Time Consumed:0.3542098Seconds
But obviously, direct mutation of arrays elements passed as parameters isn't in the spirit of Scala - this code smells of side effects.
To be honest, if you require this level of performance optimization in tight loops, you're likely better off writing these kinds of algorithms in Rust, Go or C.