Fastest way to read large Excel xlsx files? To parallelize or not?

You could try to run it in parallel using the parallel package, but it is a bit hard to estimate how fast it will be without sample data:

Click to copy

library(parallel)
library(readxl)

excel_path <- ""
sheets <- excel_sheets(excel_path)

Make a cluster with a specified number of cores:

Click to copy

cl <- makeCluster(detectCores() - 1)

Use parLapplyLB to go through the excel sheets and read them in parallel using load balancing:

Click to copy

parLapplyLB(cl, sheets, function(sheet, excel_path) {
  readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)

You can use the package microbenchmark to test how fast certain options are:

Click to copy

library(microbenchmark)

microbenchmark(
  lapply = {lapply(sheets, function(sheet) {
    read_excel(excel_path, sheet = sheet)
  })},
  parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_excel(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

In my case, the parallel version is faster:

Click to copy

Unit: milliseconds
     expr       min        lq     mean    median        uq      max neval
   lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890    10
 parralel  58.94018  64.96452 118.5969  71.42688  80.48588 316.9914    10

The test file contains of 6 sheets, each containing this table:

Click to copy

    test test1 test3 test4 test5
 1     1     1     1     1     1
 2     2     2     2     2     2
 3     3     3     3     3     3
 4     4     4     4     4     4
 5     5     5     5     5     5
 6     6     6     6     6     6
 7     7     7     7     7     7
 8     8     8     8     8     8
 9     9     9     9     9     9
10    10    10    10    10    10
11    11    11    11    11    11
12    12    12    12    12    12
13    13    13    13    13    13
14    14    14    14    14    14
15    15    15    15    15    15

Note: you can use stopCluster(cl) to shut down the workers when the process is finished.

I saw @clemens answer but since I have prepared some stuffs, so I am posting it anyway. In addition to @clemens answer, I use a bigger test data, and run simpler multicore option using furrr::future_map() which does not give any performance gain in the end...

Data generation

This will create 10 sheets of 10000 * 15 data with mixture of float, int, and character. On my disk, the filesize is 13.2MB.

Click to copy

library(writexl)
library(tidyverse)

n <- 1e4
sample_data <- map(seq(10), function(x) {
  sample_data <-
    map(1:5, function(x){
      data_frame(
        num_var = rnorm(n),
        int_var = as.integer(sample(1e5:9e5, n, replace = T)),
        char_var = sample(letters, n, replace = T)
      ) %>% rename_all(funs(paste0(., x)))
    }) %>% bind_cols()
  return(sample_data)
})
fn <- tempfile(tmpdir = "~/Desktop/temp",fileext = ".xlsx")
write_xlsx(sample_data, path = fn)

Benchmark

The parallel part is borrowed from @clemens.

Click to copy

library(parallel)
library(readxl)
library(purrr)

sheets <- excel_sheets(fn)

cl <- makeCluster(detectCores() - 1)
excel_path <- fn

microbenchmark::microbenchmark(
  map = map(sheets, function(x) read_xlsx(fn, sheet = x)) ,
  future_map = furrr::future_map(sheets, function(x) read_xlsx(fn, sheet = x)),
  parLapplyLB = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
    readxl::read_xlsx(excel_path, sheet = sheet)
  }, excel_path)},
  times = 10
)

The benchmark result looks like this:

Click to copy

Unit: milliseconds
        expr       min        lq      mean    median       uq      max neval
         map 1258.2643 1272.2354 1351.8371 1291.2474 1430.211 1556.992    10
  future_map 1276.4125 1302.2022 1469.8349 1436.5356 1616.146 1702.494    10
 parLapplyLB  809.2697  863.3299  951.1041  914.9503 1014.907 1189.897    10

I have relatively weak CPU, so in other environment the gain must be big, but in the end, probably writing SQL part could be a bottle neck as reading is really fast for read_xlsx.

Note

I also tried other packages like gdata and xlsx. These are painstakingly slow, so not worth considering.

Fastest way to read large Excel xlsx files? To parallelize or not?

Data generation

Benchmark

Note

Tags:

R

Parallel Processing

Readxl

Related

Recent Posts