Fastest way to read large Excel xlsx files? To parallelize or not?
You could try to run it in parallel using the parallel
package, but it is a bit hard to estimate how fast it will be without sample data:
library(parallel)
library(readxl)
excel_path <- ""
sheets <- excel_sheets(excel_path)
Make a cluster with a specified number of cores:
cl <- makeCluster(detectCores() - 1)
Use parLapplyLB
to go through the excel sheets and read them in parallel using load balancing:
parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)
You can use the package microbenchmark
to test how fast certain options are:
library(microbenchmark)
microbenchmark(
lapply = {lapply(sheets, function(sheet) {
read_excel(excel_path, sheet = sheet)
})},
parralel = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_excel(excel_path, sheet = sheet)
}, excel_path)},
times = 10
)
In my case, the parallel version is faster:
Unit: milliseconds
expr min lq mean median uq max neval
lapply 133.44857 167.61801 179.0888 179.84616 194.35048 226.6890 10
parralel 58.94018 64.96452 118.5969 71.42688 80.48588 316.9914 10
The test file contains of 6 sheets, each containing this table:
test test1 test3 test4 test5
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
5 5 5 5 5 5
6 6 6 6 6 6
7 7 7 7 7 7
8 8 8 8 8 8
9 9 9 9 9 9
10 10 10 10 10 10
11 11 11 11 11 11
12 12 12 12 12 12
13 13 13 13 13 13
14 14 14 14 14 14
15 15 15 15 15 15
Note:
you can use stopCluster(cl)
to shut down the workers when the process is finished.
I saw @clemens answer but since I have prepared some stuffs, so I am posting it anyway. In addition to @clemens answer, I use a bigger test data, and run simpler multicore option using furrr::future_map()
which does not give any performance gain in the end...
Data generation
This will create 10 sheets of 10000 * 15 data with mixture of float, int, and character. On my disk, the filesize is 13.2MB.
library(writexl)
library(tidyverse)
n <- 1e4
sample_data <- map(seq(10), function(x) {
sample_data <-
map(1:5, function(x){
data_frame(
num_var = rnorm(n),
int_var = as.integer(sample(1e5:9e5, n, replace = T)),
char_var = sample(letters, n, replace = T)
) %>% rename_all(funs(paste0(., x)))
}) %>% bind_cols()
return(sample_data)
})
fn <- tempfile(tmpdir = "~/Desktop/temp",fileext = ".xlsx")
write_xlsx(sample_data, path = fn)
Benchmark
The parallel
part is borrowed from @clemens.
library(parallel)
library(readxl)
library(purrr)
sheets <- excel_sheets(fn)
cl <- makeCluster(detectCores() - 1)
excel_path <- fn
microbenchmark::microbenchmark(
map = map(sheets, function(x) read_xlsx(fn, sheet = x)) ,
future_map = furrr::future_map(sheets, function(x) read_xlsx(fn, sheet = x)),
parLapplyLB = {parLapplyLB(cl, sheets, function(sheet, excel_path) {
readxl::read_xlsx(excel_path, sheet = sheet)
}, excel_path)},
times = 10
)
The benchmark result looks like this:
Unit: milliseconds
expr min lq mean median uq max neval
map 1258.2643 1272.2354 1351.8371 1291.2474 1430.211 1556.992 10
future_map 1276.4125 1302.2022 1469.8349 1436.5356 1616.146 1702.494 10
parLapplyLB 809.2697 863.3299 951.1041 914.9503 1014.907 1189.897 10
I have relatively weak CPU, so in other environment the gain must be big, but in the end, probably writing SQL part could be a bottle neck as reading is really fast for read_xlsx
.
Note
I also tried other packages like gdata
and xlsx
. These are painstakingly slow, so not worth considering.