Substitute values in a string with placeholders in Scala

Speaking strictly to the replacement problem, my preferred solution is one enabled by a feature that should probably be available in the upcoming Scala 2.8, which is the ability to replace regex patterns using a function. Using it, the problem can be reduced to this:

def replaceRegex(input: String, values: IndexedSeq[String]) =  
  """\$(\d+)""".r.replaceAllMatchesIn(input, {
    case Regex.Groups(index) => values(index.toInt)
  })

Which reduces the problem to what you actually intend to do: replace all $N patterns by the corresponding Nth value of a list.

Or, if you can actually set the standards for your input string, you could do it like this:

"select col1 from tab1 where id > %1$s and name like %2$s" format ("one", "two")

If that's all you want, you can stop here. If, however, you are interested in how to go about solving such problems in a functional way, absent clever library functions, please do continue reading.

Thinking functionally about it means thinking of the function. You have a string, some values, and you want a string back. In a statically typed functional language, that means you want something like this:

(String, List[String]) => String

If one considers that those values may be used in any order, we may ask for a type better suited for that:

(String, IndexedSeq[String]) => String

That should be good enough for our function. Now, how do we break down the work? There are a few standard ways of doing it: recursion, comprehension, folding.

RECURSION

Let's start with recursion. Recursion means to divide the problem into a first step, and then repeating it over the remaining data. To me, the most obvious division here would be the following:

Replace the first placeholder
Repeat with the remaining placeholders

That is actually pretty straight-forward to do, so let's get into further details. How do I replace the first placeholder? One thing that can't be avoided is that I need to know what that placeholder is, because I need to get the index into my values from it. So I need to find it:

(String, Pattern) => String

Once found, I can replace it on the string and repeat:

val stringPattern = "\\$(\\d+)"
val regexPattern = stringPattern.r
def replaceRecursive(input: String, values: IndexedSeq[String]): String = regexPattern findFirstIn input match {
  case regexPattern(index) => replaceRecursive(input replaceFirst (stringPattern, values(index.toInt)))
  case _ => input // no placeholder found, finished
}

That is inefficient, because it repeatedly produces new strings, instead of just concatenating each part. Let's try to be more clever about it.

To efficiently build a string through concatenation, we need to use StringBuilder. We also want to avoid creating new strings. StringBuilder can accepts CharSequence, which we can get from String. I'm not sure if a new string is actually created or not -- if it is, we could roll our own CharSequence in a way that acts as a view into String, instead of creating a new String. Assured that we can easily change this if required, I'll proceed on the assumption it is not.

So, let's consider what functions we need. Naturally, we'll want a function that returns the index into the first placeholder:

String => Int

But we also want to skip any part of the string we have already looked at. That means we also want a starting index:

(String, Int) => Int

There's one small detail, though. What if there's on further placeholder? Then there wouldn't be any index to return. Java reuses the index to return that exception. When doing functional programming however, it is always best to return what you mean. And what we mean is that we may return an index, or we may not. The signature for that is this:

(String, Int) => Option[Int]

Let's build this function:

def indexOfPlaceholder(input: String, start: Int): Option[Int] = if (start < input.lengt) {
  input indexOf ("$", start) match {
    case -1 => None
    case index => 
      if (index + 1 < input.length && input(index + 1).isDigit)
        Some(index)
      else
        indexOfPlaceholder(input, index + 1)
  }
} else {
  None
}

That's rather complex, mostly to deal with boundary conditions, such as index being out of range, or false positives when looking for placeholders.

To skip the placeholder, we'll also need to know it's length, signature (String, Int) => Int:

def placeholderLength(input: String, start: Int): Int = {
  def recurse(pos: Int): Int = if (pos < input.length && input(pos).isDigit)
    recurse(pos + 1)
  else
    pos
  recurse(start + 1) - start  // start + 1 skips the "$" sign
}

Next, we also want to know what, exactly, the index of the value the placeholder is standing for. The signature for this is a bit ambiguous:

(String, Int) => Int

The first Int is an index into the input, while the second is an index into the values. We could do something about that, but not that easily or efficiently, so let's ignore it. Here's an implementation for it:

def indexOfValue(input: String, start: Int): Int = {
  def recurse(pos: Int, acc: Int): Int = if (pos < input.length && input(pos).isDigit)
    recurse(pos + 1, acc * 10 + input(pos).asDigit)
  else
    acc
  recurse(start + 1, 0) // start + 1 skips "$"
}

We could have used the length too, and achieve a simpler implementation:

def indexOfValue2(input: String, start: Int, length: Int): Int = if (length > 0) {
  input(start + length - 1).asDigit + 10 * indexOfValue2(input, start, length - 1)
} else {
  0
}

As a note, using curly brackets around simple expressions, such as above, is frowned upon by conventional Scala style, but I use it here so it can be easily pasted on REPL.

So, we can get the index to the next placeholder, its length, and the index of the value. That's pretty much everything needed for a more efficient version of replaceRecursive:

def replaceRecursive2(input: String, values: IndexedSeq[String]): String = {
  val sb = new StringBuilder(input.length)
  def recurse(start: Int): String = if (start < input.length) {
    indexOfPlaceholder(input, start) match {
      case Some(placeholderIndex) =>
        val placeholderLength = placeholderLength(input, placeholderIndex)
        sb.append(input subSequence (start, placeholderIndex))
        sb.append(values(indexOfValue(input, placeholderIndex)))
        recurse(start + placeholderIndex + placeholderLength)
      case None => sb.toString
    }
  } else {
    sb.toString
  }
  recurse(0)
}

Much more efficient, and as functional as one can be using StringBuilder.

COMPREHENSION

Scala Comprehensions, at their most basic level, means transforming T[A] into T[B] given a function A => B, something known as a functor. It can be easily understood when it comes to collections. For instance, I may transform a List[String] of names into a List[Int] of name lengths through a function String => Int which returns the length of a string. That's a list comprehension.

There are other operations that can be done through comprehensions, given functions with signatures A => T[B], which is related to monads, or A => Boolean.

That means we need to see the input string as a T[A]. We can't use Array[Char] as input because we want to replace the whole placeholder, which is larger than a single char. Let's propose, therefore, this type signature:

(List[String], String => String) => String

Since we the input we receive is String, we need a function String => List[String] first, which will divide our input into placeholders and non-placeholders. I propose this:

val regexPattern2 = """((?:[^$]+|\$(?!\d))+)|(\$\d+)""".r
def tokenize(input: String): List[String] = regexPattern2.findAllIn(input).toList

Another problem we have is that we got an IndexedSeq[String], but we need a String => String. There are many ways around that, but let's settle with this:

def valuesMatcher(values: IndexedSeq[String]): String => String = (input: String) => values(input.substring(1).toInt - 1)

We also need a function List[String] => String, but List's mkString does that already. So there's little left to do aside composing all this stuff:

def comprehension(input: List[String], matcher: String => String) = 
  for (token <- input) yield (token: @unchecked) match {
    case regexPattern2(_, placeholder: String) => matcher(placeholder)
    case regexPattern2(other: String, _) => other
  }

I use @unchecked because there shouldn't be any pattern other than these two above, if my regex pattern was built correctly. The compiler doesn't know that, however, so I use that annotation to silent the warning it would produce. If an exception is thrown, there's a bug in the regex pattern.

The final function, then, unifies all that:

def replaceComprehension(input: String, values: IndexedSeq[String]) =
  comprehension(tokenize(input), valuesMatcher(values)).mkString

One problem with this solution is that I apply the regex pattern twice: once to break up the string, and the other to identify the placeholders. Another problem is that the List of tokens is an unnecessary intermediate result. We can solve that with these changes:

def tokenize2(input: String): Iterator[List[String]] = regexPattern2.findAllIn(input).matchData.map(_.subgroups)

def comprehension2(input: Iterator[List[String]], matcher: String => String) = 
  for (token <- input) yield (token: @unchecked) match {
    case List(_, placeholder: String) => matcher(placeholder)
    case List(other: String, _) => other
  }

def replaceComprehension2(input: String, values: IndexedSeq[String]) =
  comprehension2(tokenize2(input), valuesMatcher(values)).mkString

FOLDING

Folding is a bit similar to both recursion and comprehension. With folding, we take a T[A] input that can be comprehended, a B "seed", and a function (B, A) => B. We comprehend the list using the function, always taking the B that resulted from the last element processed (the first element takes the seed). Finally, we return the result of the last comprehended element.

I'll admit I could hardly explained it in a less-obscure way. That's what happens when you try to keep abstract. I explained it that way so that the type signatures involved become clear. But let's just see a trivial example of folding to understand its usage:

def factorial(n: Int) = {
  val input = 2 to n
  val seed = 1
  val function = (b: Int, a: Int) => b * a
  input.foldLeft(seed)(function)
}

Or, as a one-liner:

def factorial2(n: Int) = (2 to n).foldLeft(1)(_ * _)

Ok, so how would we go about solving the problem with folding? The result, of course, should be the string we want to produce. Therefore, the seed should be an empty string. Let's use the result from tokenize2 as the comprehensible input, and do this:

def replaceFolding(input: String, values: IndexedSeq[String]) = {
  val seed = new StringBuilder(input.length)
  val matcher = valuesMatcher(values)
  val foldingFunction = (sb: StringBuilder, token: List[String]) => {
    token match {          
      case List(_, placeholder: String) => sb.append(matcher(placeholder))
      case List(other: String, _) => sb.append(other)
    }
    sb
  }
  tokenize2(input).foldLeft(seed)(foldingFunction).toString
}

And, with that, I finish showing the most usual ways one would go about this in a functional manner. I have resorted to StringBuilder because concatenation of String is slow. If that wasn't the case, I could easily replace StringBuilder in functions above by String. I could also convert Iterator into a Stream, and completely do away with mutability.

This is Scala, though and Scala is about balancing needs and means, not of purist solutions. Though, of course, you are free to go purist. :-)

You can use the standard Java String.format style with a twist:

"My name is %s and I am %d years of age".format("Oxbow", 34)

In Java of course this would have looked like:

String.format("My name is %s and I am %d years of age", "Oxbow", 34)

The primary difference between these two styles (I much prefer Scala's) is that conceptually this means that every String can be considered a format string in Scala (i.e. the format method appears to be an instance method on the String class). Whilst this might be argued to be conceptually wrong, it leads to more intuitive and readable code.

This formatting style allows you to format floating-point numbers as you wish, dates etc. The main issue with it is that the "binding" between the placeholders in the format string and the arguments is purely order based, not related to names in any way (like "My name is ${name}") although I fail to see how...

interpolate("My name is ${name} and I am ${age} years of age", 
               Map("name" -> "Oxbow", "age" -> 34))

...is any more readable embedded in my code. This sort of thing is much more useful for text replacement where the source text is embedded in separate files (in i18n for example) where you would want something like:

"name.age.intro".text.replacing("name" as "Oxbow").replacing("age" as "34").text

Or:

"My name is ${name} and I am ${age} years of age"
     .replacing("name" as "Oxbow").replacing("age" as "34").text

I would think that this would be pretty easy to use and take just a few minutes to write (I can't seem to get Daniel's interpolate to compile with the Scala 2.8 version I have):

object TextBinder {
  val p = new java.util.Properties
  p.load(new FileInputStream("C:/mytext.properties"))

  class Replacer(val text: String) {
    def replacing(repl: Replacement) = new Replacer(interpolate(text, repl.map))
  }

  class Replacement(from: String, to: String) {
    def map = Map(from -> to)
  }
  implicit def stringToreplacementstr(from: String) = new {
    def as(to: String) = new Replacement(from, to)
    def text = p.getProperty(from)
    def replacing(repl: Replacement) = new Replacer(from)
  }

  def interpolate(text: String, vars: Map[String, String]) = 
    (text /: vars) { (t, kv) => t.replace("${"+kv._1+"}", kv._2)  }
}

I am a a sucker for fluent APIs by the way! No matter how unperformant they are!

Substitute values in a string with placeholders in Scala

Tags:

Functional Programming

Scala

Related

Recent Posts