Fast and efficient character DataFrame creation in Rcpp
Thanks for making a snapshot of the data available (BTW: no point tar'ing a single file, you could just have xz
'ed the csvfile. Anyway.)
I get different results on my Ubuntu 20.04 box which are closer to what I anticipated:
data.table::fread()
is competitive as we expected (I am runningdata.table
fromgit
as there was a regression in the most recent release)vroom
andstringfish
, once we force materialization to compare apples to apples rather than images of apples, are about the sameRcpp
is in the ballpark too but a little more variable
I capped it at 10 runs, the variability probably comes down if you run more but the caching influences it too.
In short: no clear winners, and surely no mandate to replace one of the (alreadty known to be tuned) reference implementations.
edd@rob:~/git/stackoverflow/65043010(master)$ Rscript bm.R
Unit: seconds
expr min lq mean median uq max neval cld
fread 1.37294 1.51211 1.54004 1.55138 1.57639 1.62939 10 a
vroom 1.44670 1.53659 1.62104 1.61172 1.61764 1.88921 10 a
sfish 1.21609 1.57000 1.57635 1.60180 1.63933 1.72975 10 a
rcpp1 1.44111 1.45354 1.61275 1.55190 1.60535 2.15847 10 a
rcpp2 1.47902 1.57970 1.75067 1.60114 1.64857 2.75851 10 a
edd@rob:~/git/stackoverflow/65043010(master)$
Code for top-level script
suppressMessages({
library(data.table)
library(Rcpp)
library(vroom)
library(stringfish)
library(microbenchmark)
})
vroomread <- function(csvfile) {
a <- vroom(csvfile, col_types = "cc", progress = FALSE)
vroom:::vroom_materialize(a, TRUE)
}
sfread <- function(csvfile) {
a <- sf_readLines(csvfile)
dt <- data.table::data.table(uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163))
}
sourceCpp("rcppfuncs.cpp")
csvfile <- "sudoku_100k.csv"
microbenchmark(fread=fread(csvfile),
vroom=vroomread(csvfile),
sfish=sfread(csvfile),
rcpp1=setalloccol(read_to_df_ifstream(csvfile)),
rcpp2=setalloccol(read_to_df_ifstream_charvector(csvfile)),
times=10)
Code for Rcpp script
#include <Rcpp.h>
#include <fstream>
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename, std::ifstream::in);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
This is not really a proper answer to my question, more some thoughts I didn't want to go wasted as well as some benchmarks. Maybe useful to someone who faces a similar issue.
To recall, the basic idea is to read 1 million rows of two 81 character long strings into an R object (preferably a data.frame, data.table, or tibble). For the benchmarks I have used the 1 million sudoku dataset of Kyubyong Park.
I structured the answer into two parts: 1) using other R packages and 2) using Rcpp/C++ and C to work on a lower level.
Surprisingly, for character data specialised packages such as stringi
, stringfish
, or vroom
are really efficient and beat (my) lower level C++/C code.
Important to note is that some packages use ALTREP (see for example Francoise take on them here), which means that the data does not materialize in R until needed. I.e., loading the data using vroom takes less than 1 second, but the first operations (which need to materialize the data) take way longer... To circumnavigate this, I either force the materialization of the data by putting it into a data.table or use an internal function of vroom to force it.
1) R packages
data.table and fread
- 75 secs
Mainly as a base benchmark.
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
dt <- data.table::fread(file, colClasses = "character")
tictoc::toc()
#> 75.296 sec elapsed
Vroom with materialization - 19 secs
Note that vroom uses ALTREP, forcing materialization to level the playing field!
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- vroom::vroom(file, col_types = "cc", progress = FALSE)
# internal function that materializes the ALTREP data
df <- vroom:::vroom_materialize(a, TRUE)
tictoc::toc()
#> 19.926 sec elapsed
Stringfish - 19 secs
Stringfish uses ALTREP, so reading the data and taking the substrings takes less than one second. Materialization takes the rest, similar to vroom.
library(stringfish)
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- sf_readLines(file)
dt <- data.table::data.table(
uns = sf_substr(a, 1, 81),
sol = sf_substr(a, 83, 163)
)
tictoc::toc()
#> 19.698 sec elapsed
Stringi - 22 secs
Note that the conversion to data.table takes virtually no time.
tictoc::tic()
a <- stringi::stri_read_lines(file)
# discard header
a <- a[-1]
dt <- data.table::data.table(
uns = stringi::stri_sub(a, 1, 81),
sol = stringi::stri_sub(a, 83, 163)
)
tictoc::toc()
#> 22.409 sec elapsed
2) C and Cpp functions
Rcpp with ifstream
read to STL first - 22 secs
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
std::vector<std::string> a, b;
a.reserve(n_lines);
b.reserve(n_lines);
while (std::getline(file, line)) {
a.push_back(line.substr(0, 80));
b.push_back(line.substr(82, 162));
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 22.098 sec elapsed
*/
Rcpp with ifstream
read directly to Rcpp::CharacterVector
- 21 secs
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_ifstream_charvector(std::string filename) {
const int n_lines = 1000000;
std::ifstream file(filename);
std::string line;
// burn the header
std::getline(file, line);
Rcpp::CharacterVector a(n_lines), b(n_lines);
int l = 0;
while (std::getline(file, line)) {
a(l) = line.substr(0, 80);
b(l) = line.substr(82, 162);
l++;
}
Rcpp::List df(2);
df.names() = Rcpp::CharacterVector::create("unsolved", "solved");
df["unsolved"] = a;
df["solved"] = b;
df.attr("class") = Rcpp::CharacterVector::create("data.table", "data.frame");
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_ifstream_charvector(file)
dt <- data.table::setalloccol(raw)
tictoc::toc()
#> 21.436 sec elapsed
*/
Rcpp with buffer - 75 secs
This is basically the initial approach I chose, as outlined in the question above. Not really sure why its slower than the others...
//[[Rcpp::export]]
Rcpp::DataFrame read_to_df_buffer(std::string filename) {
const int max_buffer_size = 1e8;
const int header_size = 18; // only fixed in this example...
const int n_lines = 1000000;
FILE* infile;
infile = fopen(filename.c_str(), "r");
if (infile == NULL) Rcpp::stop("File Error!\n");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
// initiate the buffers
char* buffer;
int64_t buffer_size = sizeof(char) * max_buffer_size > file_size
? file_size : max_buffer_size;
buffer = (char*) malloc(buffer_size);
// skip the header...
int64_t this_buffer_size = fread(buffer, 1, header_size, infile);
// a holds the first part (quizzes or unsolved) b holds solution/solved
std::vector<std::string> a, b;
a.resize(n_lines);
b.resize(n_lines);
const int line_length = 2 * 82; // 2 times 81 digits plus one , or newline
int l = 0;
// fill the buffer
int current_pos = ftell(infile);
int next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
while ((this_buffer_size = fread(buffer, 1, next_buffer_size, infile)) > 0) {
// read a buffer from current_pos to ftell(infile)
Rcpp::checkUserInterrupt();
int i = 0;
while (i + line_length <= this_buffer_size) {
a[l] = std::string(buffer + i, buffer + i + 81);
i += 82;
b[l] = std::string(buffer + i, buffer + i + 81);;
i += 82;
l++;
}
if (i == 0) break;
if (i != this_buffer_size) {
// file pointer reset by i - this_buffer_size (offset to end of buffer)
fseek(infile, i - this_buffer_size, SEEK_CUR);
}
// determine the next buffer size. If the buffer is too large, take only whats
// needed
current_pos = ftell(infile);
next_buffer_size = file_size - current_pos > buffer_size
? buffer_size : file_size - current_pos;
}
free(buffer);
fclose(infile);
Rcpp::DataFrame df = Rcpp::DataFrame::create(
Rcpp::Named("unsolved") = a,
Rcpp::Named("solved") = b,
Rcpp::Named("stringsAsFactors") = false
);
return df;
}
/*** R
tictoc::tic()
file <- "sudokus/sudoku_1m.csv"
raw <- read_to_df_buffer(file)
tictoc::toc()
75.915 sec elapsed
*/
Using Rs C API - 125 secs
Not sure why this is not faster, probably because my C code is not efficient... If you have any improvements, I'll gladly update the timings.
The mkChar()
function creates a CHARSXP
which can be inserted into a character vector STRSXP
. Note that most R characters are stored in a cache (see also 1.10 of R Internals), maybe if we can circumvent the cache we can gain some speedups - not sure how to do this or if this is wise in any way...
Preferably, I would like to pre allocate 1 mln STRSXP
of size 81, memcpy()
the values from the C array, and SET_STRING_ELT()
them to the vector. No idea how to do it, though.
See also:
- https://cran.r-project.org/doc/manuals/r-release/R-ints.html
- http://adv-r.had.co.nz/C-interface.html
- https://github.com/hadley/r-internals/
read_to_list_sexp <- inline::cfunction(c(fname = "character"), '
const char * filename = CHAR(asChar(fname));
FILE* infile;
infile = fopen(filename, "r");
if (infile == NULL) error("File cannot be opened");
fseek(infile, 0L, SEEK_END);
int64_t file_size = ftell(infile);
fseek(infile, 0L, SEEK_SET);
const int n_lines = 1000000;
SEXP uns = PROTECT(allocVector(STRSXP, n_lines));
SEXP sol = PROTECT(allocVector(STRSXP, n_lines));
char * line = NULL;
size_t len = 0;
ssize_t read;
int l = 0;
char char_array[82];
char_array[81] = 0;
// skip header
read = getline(&line, &len, infile);
while ((read = getline(&line, &len, infile)) != -1) {
memcpy(char_array, line, 81);
SET_STRING_ELT(uns, l, mkChar(char_array));
memcpy(char_array, line + 82, 81);
SET_STRING_ELT(sol, l, mkChar(char_array));
l++;
if (l == n_lines) break;
}
fclose(infile);
SEXP res = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(res, 0, uns);
SET_VECTOR_ELT(res, 1, sol);
UNPROTECT(3);
return res;
')
file <- "sudokus/sudoku_1m.csv"
tictoc::tic()
a <- foo(file)
df <- data.table::as.data.table(a)
tictoc::toc()
#> 125.514 sec elapsed