Generating a frequency map for a string in Scala
Extending Axel's answer.
Your groupBy
solution is already functional. There's just a tiny-tiny correction to it which could make it cleaner:
str.groupBy(_.toChar).mapValues(_.size)
The Scala's alternative to inject
is foldLeft
, foldRight
, reduce
, reduceOption
depending on how you use it. The way you've used inject
in Ruby is not functional, since your solution is based on mutating h
and in functional world mutability is a "no-no". Here's how you'd do the solution close to your inject
but in functional style in Scala:
str.foldLeft( Map[Char, Int]() ){ (m, c) => m + (c -> (m.getOrElse(c, 0) + 1)) }
Obviously groupBy
looks much better.
1) What does p
mean?
groupBy
takes a function which maps an elements to a key of type K
. When invoked on some collection Coll
, it returns a Map[K, Coll]
which contains mappings from keys K
to all the elements which mapped to the same key.
So, in your case, str.groupBy(_.toChar)
yields a map mapping from a key k
(which is a character) to a string with all the elements (characters) c
such that k == c.toChar
.
You get this:
Map(e -> "e", h -> "h", l -> "ll", o -> "o")
A Map
is an iterable of pairs of keys and values. In this case, each pair is a character and a string of elements. Calling the map
operation on a Map
involves mapping on these pairs - p
is a pair where p._1
is a character, and p._2
is the associated string (on which you can call length
, as you did above).
2) How to do this idiomatically
The above is how to do it idiomatically - using groupBy
and map
. Alternatively, you can use an immutable map and recursion on the string length to compute the frequencies, or an immutable map and a foldLeft
.
3) Performance characteristic
Best to benchmark to see the differences. Here are a couple of microbenchmark for a highly-repetitive string (~3GHz iMac, JDK7, Scala 2.10.0 nightly):
object Imperative extends testing.Benchmark {
val str = "abc" * 750000
def run() {
var counts = new scala.collection.mutable.HashMap[Char,Int]
var i = 0
val until = str.length
while (i < until) {
var c = str(i)
if (counts.contains(c))
counts.put(c, counts(c) + 1)
else
counts.put(c, 1)
i += 1
}
//println(f)
}
}
object Combinators extends testing.Benchmark {
val str = "abc" * 750000
def run() {
val f = str.groupBy(_.toChar).map(p => (p._1, p._2.length))
}
}
object Fold extends testing.Benchmark {
val str = "abc" * 750000
def run() {
val f = str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}
}
}
Results:
Imperative:
$ 103 57 53 58 53 53 53 53 53 53
Combinators:
$ 72 51 63 56 53 52 52 54 53 53
Fold:
$ 163 62 71 62 57 57 57 58 57 57
Note that changing the imperative version to use withDefaultValue
:
var counts = new scala.collection.mutable.HashMap[Char,Int].withDefaultValue(0)
var i = 0
val until = str.length
while (i < until) {
var c = str(i)
counts.put(c, counts(c) + 1)
i += 1
}
is apparently terribly slow due to forwarding each put
call:
withDefaultValue
:$ 133 87 109 106 101 100 101 100 101 101
Conclusion: the boxing and unboxing of characters in this case is high-enough so that the differences in performance between these approaches are hard to observe.
EDIT:
Update: You may want to use ScalaMeter inline benchmarking in place of the Benchmark
trait.