Generating a frequency map for a string in Scala

Extending Axel's answer.

Your groupBy solution is already functional. There's just a tiny-tiny correction to it which could make it cleaner:

str.groupBy(_.toChar).mapValues(_.size)

The Scala's alternative to inject is foldLeft, foldRight, reduce, reduceOption depending on how you use it. The way you've used inject in Ruby is not functional, since your solution is based on mutating h and in functional world mutability is a "no-no". Here's how you'd do the solution close to your inject but in functional style in Scala:

str.foldLeft( Map[Char, Int]() ){ (m, c) => m + (c -> (m.getOrElse(c, 0) + 1)) }

Obviously groupBy looks much better.


1) What does p mean?

groupBy takes a function which maps an elements to a key of type K. When invoked on some collection Coll, it returns a Map[K, Coll] which contains mappings from keys K to all the elements which mapped to the same key.

So, in your case, str.groupBy(_.toChar) yields a map mapping from a key k (which is a character) to a string with all the elements (characters) c such that k == c.toChar. You get this:

Map(e -> "e", h -> "h", l -> "ll", o -> "o")

A Map is an iterable of pairs of keys and values. In this case, each pair is a character and a string of elements. Calling the map operation on a Map involves mapping on these pairs - p is a pair where p._1 is a character, and p._2 is the associated string (on which you can call length, as you did above).

2) How to do this idiomatically

The above is how to do it idiomatically - using groupBy and map. Alternatively, you can use an immutable map and recursion on the string length to compute the frequencies, or an immutable map and a foldLeft.

3) Performance characteristic

Best to benchmark to see the differences. Here are a couple of microbenchmark for a highly-repetitive string (~3GHz iMac, JDK7, Scala 2.10.0 nightly):

object Imperative extends testing.Benchmark {
  val str = "abc" * 750000

  def run() {
    var counts = new scala.collection.mutable.HashMap[Char,Int]
    var i = 0
    val until = str.length
    while (i < until) {
      var c = str(i)
      if (counts.contains(c))
        counts.put(c, counts(c) + 1)
      else
        counts.put(c, 1)
      i += 1
    }

    //println(f)
  }
}


object Combinators extends testing.Benchmark {
  val str = "abc" * 750000

  def run() {
    val f = str.groupBy(_.toChar).map(p => (p._1, p._2.length))
  }
}


object Fold extends testing.Benchmark {
  val str = "abc" * 750000

  def run() {
    val f = str.foldLeft(Map[Char, Int]() withDefaultValue 0){(h, c) => h.updated(c, h(c)+1)}
  }
}

Results:

  • Imperative: $ 103 57 53 58 53 53 53 53 53 53

  • Combinators: $ 72 51 63 56 53 52 52 54 53 53

  • Fold: $ 163 62 71 62 57 57 57 58 57 57

Note that changing the imperative version to use withDefaultValue:

var counts = new scala.collection.mutable.HashMap[Char,Int].withDefaultValue(0)
var i = 0
val until = str.length
while (i < until) {
  var c = str(i)
  counts.put(c, counts(c) + 1)
  i += 1
}

is apparently terribly slow due to forwarding each put call:

  • withDefaultValue: $ 133 87 109 106 101 100 101 100 101 101

Conclusion: the boxing and unboxing of characters in this case is high-enough so that the differences in performance between these approaches are hard to observe.

EDIT:

Update: You may want to use ScalaMeter inline benchmarking in place of the Benchmark trait.

Tags:

String

Map

Scala