Fast escaping/deparsing of character vectors in R

Here's a C++ version of Winston's code. It's quite a lot simpler because you can efficiently grow std::strings. It's also less likely to crash because Rcpp takes care of memory management for you.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
std::string escape_one(std::string x) {
  std::string out = "\"";

  int n = x.size();
  for (int i = 0; i < n; ++i) {
    char cur = x[i];

    switch(cur) {
      case '\\': out += "\\\\"; break;
      case '"':  out += "\\\""; break;
      case '\n': out += "\\n";  break;
      case '\r': out += "\\r";  break;
      case '\t': out += "\\t";  break;
      case '\b': out += "\\b";  break;
      case '\f': out += "\\f";  break;
      default:     out += cur;
    }
  }

  out += '"';

  return out;
}

// [[Rcpp::export]]
CharacterVector escape_chars(CharacterVector x) {
  int n = x.size();
  CharacterVector out(n);

  for (int i = 0; i < n; ++i) {
    String cur = x[i];
    out[i] = escape_one(cur);
  }

  return out;
}

On your benchmark, deparse_vector2(strings) takes 0.8s, and escape_chars(strings) takes 0.165s.


I don't know of a faster way to do this with just R code, but I did decide to try my hand at implementing it in C, wrapped in an R function called deparse_vector3. It's rough (and I'm far from an expert C programmer) but it seems to work for your examples: https://gist.github.com/wch/e3ec5b20eb712f1b22b2

On my system (Mac, R 3.1.1), deparse_vector2 is over 20x faster than deparse_vector, which is a much bigger difference than the 5x you got in your test.

My deparse_vector3 function is just 3x faster than deparse_vector2. There's probably room for improvement.

> system.time(out1 <- deparse_vector1(strings))
   user  system elapsed 
  8.459   0.009   8.470 
> system.time(out2 <- deparse_vector2(strings))
   user  system elapsed 
  0.368   0.007   0.374 
> system.time(out3 <- deparse_vector3(strings))
   user  system elapsed 
  0.120   0.001   0.120 

I don't think this will correctly handle non-ASCII character encodings, though. Here's an example of how encodings are handled in the R source: https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/grep.c#L588-L630

Edit: This seems to handle UTF-8 OK, though it's possible I'm missing something in my testing.


One more stab at this problem that takes advantage of a couple facts.

Given a string x with length n, we know the output string will be at least length x, and at most 2 * x. We can take advantage of this to ensure we only allocate memory once, rather than rely on containers that grow (albeit efficiently).

Note that I make use of C++11's shared_ptr here, since I am doing ugly things with raw memory (and want to ensure it gets cleaned up automatically). This also allows me to avoid the initial pass wherein I attempt to count matches, but also forces me to over-allocate a bit excessively (the case wherein every single character must be escaped will be rare).

It would be relatively easy to adapt this to a pure C solution, I think, but would be trickier to ensure memory is properly cleaned up.

#include <memory>
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
void escape_one_fill(CharacterVector const& x, int i, CharacterVector& output) {

  auto xi = CHAR(STRING_ELT(x, i));
  int n = strlen(xi);

  // Over-allocate memory -- we know that in the worst case the output
  // string is 2x the length of x (plus 1 for \0)
  auto out = std::make_shared<char*>(new char[n * 2 + 1]);

  int counter = 0;
  (*out)[counter++] = '"';

  #define HANDLE_CASE(X, Y) \
    case X: \
      (*out)[counter++] = '\\'; \
      (*out)[counter++] = Y; \
      break;

  for (int j = 0; j < n; ++j) {
    switch (xi[j]) {
      HANDLE_CASE('\\', '\\');
      HANDLE_CASE('"', '"');
      HANDLE_CASE('\n', 'n');
      HANDLE_CASE('\r', 'r');
      HANDLE_CASE('\t', 't');
      HANDLE_CASE('\b', 'b');
      HANDLE_CASE('\f', 'f');
      default: (*out)[counter++] = xi[j];
    }
  }

  (*out)[counter++] = '"';

  // Set a NUL so that Rf_mkChar does what it should
  (*out)[counter++] = '\0';
  SET_STRING_ELT(output, i, Rf_mkChar(*out));

}

// [[Rcpp::export]]
CharacterVector escape_chars_with_fill(CharacterVector x) {
  int n = x.size();
  CharacterVector out(n);

  for (int i = 0; i < n; ++i) {
    escape_one_fill(x, i, out);
  }

  return out;
}

Benchmarking this, I get (just comparing to Hadley's impl):

> mychars <- c(letters, " ", '"', "\\", "\t", "\n", "\r", "'", "/", "#", "$");

> createstring <- function(length){
+   paste(mychars[ceiling(runif(length, 0, length(mychars)))], collapse="")
+ }

> strings <- vapply(rep(1000, 10000), createstring, character(1), USE.NAMES=FALSE)

> system.time(escape_chars(strings))
   user  system elapsed 
   0.14    0.00    0.14 

> system.time(escape_chars_with_fill(strings))
   user  system elapsed 
  0.080   0.001   0.081 

> identical(escape_chars(strings), escape_chars_with_fill(strings))
[1] TRUE

You can also try stri_escape_unicode from the stringi package (although you preferred a solution without additional dependencies but I think it could be useful for future readers too) which about 3 times faster than deparse_vector2 and about 7 times faster than deparse_vector

require(stringi)

Defining the function

deparse_vector3 <- function(x){
  paste0("\"",stri_escape_unicode(x), "\"")
}

Checking that all functions give smae result

all.equal(deparse_vector2(test), deparse_vector3(test))
## [1] TRUE
all.equal(deparse_vector(test), deparse_vector3(test))
## [1] TRUE

Some benchmarks

library(microbenchmark)
microbenchmark(deparse_vector(test), 
               deparse_vector2(test),
               deparse_vector3(test), times = 1000L)

# Unit: microseconds
#                  expr    min      lq  median      uq      max neval
#  deparse_vector(test) 98.548 102.654 104.707 111.380 2500.653  1000
# deparse_vector2(test) 43.114  46.707  48.761  51.327  401.377  1000
# deparse_vector3(test) 14.885  16.938  18.991  20.018  240.211  1000 <-- Clear winner