Reduce computing time for reshape
Benchmarks Summary:
Using Stack
(as suggested by @AnandaMahto) is definitely
the way to go for smaller data sets (N < 3,000).
As the data sets gets larger, data.table
begins to outperform stack
Here is an option using data.table
dtt <- data.table(data)
# non value columns, ie, the columns to keep post reshape
nvc <- c("Name","Code", "CURRENCY")
# name of columns being transformed
dateCols <- setdiff(names(data), nvc)
# use rbind list to combine subsets
dtt2 <- rbindlist(lapply(dateCols, function(d) {
dtt[, Date := d]
cols <- c(nvc, "Date", d)
setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))
}))
## Results:
dtt2
# Name Code CURRENCY Date value
# 1: Abengoa 4256 USD X_01_01_1980 1.53
# 2: Adidas 6783 USD X_01_01_1980 0.23
# 3: Abengoa 4256 USD X_02_01_1980 1.54
# 4: Adidas 6783 USD X_02_01_1980 0.54
# 5: ... <cropped>
Updated Benchmarks with larger sample data
As per the suggestion from @AnandaMahto, below are benchmarks using a large (larger) sample data. Please feel free to improve any of the methods used below and/or add new methods.
Benchmarks
Resh <- quote(reshape::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
Resh2 <- quote(reshape2::melt(data,id=c("Name","Code", "CURRENCY"),variable_name="Date"))
DT <- quote({ nvc <- c("Name","Code", "CURRENCY"); dateCols <- setdiff(names(data), nvc); rbindlist(lapply(dateCols, function(d) { dtt[, Date := d]; cols <- c(nvc, "Date", d); setnames(dtt[, cols, with=FALSE], cols, c(nvc, "Date", "value"))}))})
Stack <- quote(data.frame(data[1:3], stack(data[-c(1, 2, 3)])))
# SAMPLE SIZE: ROWS = 900; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(Resh=eval(Resh),Resh2=eval(Resh2),DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 Stack 0.813 0.623 0.192 5
# 2.530 DT 2.057 2.035 0.026 5
# 40.470 Resh 32.902 18.410 14.602 5
# 40.578 Resh2 32.990 18.419 14.728 5
# SAMPLE SIZE: ROWS = 3,500; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.00 DT 2.407 2.336 0.076 5
# 1.08 Stack 2.600 1.626 0.983 5
# SAMPLE SIZE: ROWS = 27,000; COLS = 380 + 3;
dtt <- data.table(data);
benchmark(DT=eval(DT), Stack=eval(Stack), replications=5, columns=c("relative", "test", "elapsed", "user.self", "sys.self", "replications"), order="relative")
# relative test elapsed user.self sys.self replications
# 1.000 DT 10.450 7.418 3.058 5
# 2.232 Stack 23.329 14.180 9.266 5
Sample Data Creation
# rm(list=ls(all=TRUE))
set.seed(1)
LLLL <- apply(expand.grid(LETTERS, LETTERS[10:15], LETTERS[1:20], LETTERS[1:5], stringsAsFactors=FALSE), 1, paste0, collapse="")
size <- 900
dateSamples <- 380
startDate <- as.Date("1980-01-01")
Name <- apply(matrix(LLLL[1:(2*size)], ncol=2), 1, paste0, collapse="")
Code <- sample(1e3:max(1e4-1, size+1e3), length(Name))
CURRENCY <- sample(c("USD", "EUR", "YEN"), length(Name), TRUE)
Dates <- seq(startDate, length.out=dateSamples, by="mon")
Values <- sample(c(1:1e2, 1:5e2), size=size*dateSamples, TRUE) / 1e2
# Calling the sample dataframe `data` to keep consistency, but I dont like this practice
data <- data.frame(Name, Code, CURRENCY,
matrix(Values, ncol=length(Dates), dimnames=list(c(), as.character(Dates)))
)
data[1:6, 1:8]
# Name Code CURRENCY X1980.01.01 X1980.02.01 X1980.03.01 X1980.04.01 X1980.05.01
# 1 AJAAQNFA 3389 YEN 0.37 0.33 3.58 4.33 1.06
# 2 BJAARNFA 4348 YEN 1.14 2.69 2.57 0.27 3.02
# 3 CJAASNFA 6154 USD 2.47 3.72 3.32 0.36 4.85
# 4 DJAATNFA 9171 USD 2.22 2.48 0.71 0.79 2.85
# 5 EJAAUNFA 2814 USD 2.63 2.17 1.66 0.55 3.12
# 6 FJAAVNFA 9081 USD 1.92 1.47 3.51 3.23 3.68
From the question :
data <- read.csv("data.csv")
and
... for .csv files that are about 50MB big, it already takes about two hours ...
So although stack/melt/reshape comes into play, I'm guessing (since this is your fist ever S.O. question) that the biggest factor here is read.csv
. Assuming you're including that in your timing as well as melt
(it isn't clear).
Default arguments to read.csv
are well known to be slow. A few quick searches should reveal hint and tips (e.g. stringsAsFactors
, colClasses
) such as :
- http://cran.r-project.org/doc/manuals/R-data.html
- Quickly reading very large tables as dataframes
But I'd suggest fread
(since data.table
1.8.7). To get a feel for fread
its manual page in raw text form is here:
https://www.rdocumentation.org/packages/data.table/versions/1.12.2/topics/fread
The examples section there, as it happens, has a 50MB example shown to be read in 3 seconds instead of up to 60. And benchmarks are starting to appear in other answers which is great to see.
Then the stack/reshape/melt answers are next order, if I guessed correctly.
While the testing is going on, I'll post my comment as an answer for you to consider. Try using stack
as in:
data1 <- data.frame(data[1:3], stack(data[-c(1, 2, 3)]))
In many cases, stack
works really efficiently with these types of operations, and adding back in the first few columns also works quickly because of how vectors are recycled in R.
For that matter, this might also be worth considering:
data.frame(data[1:3],
vals = as.vector(as.matrix(data[-c(1, 2, 3)])),
date = rep(names(data)[-c(1, 2, 3)], each = nrow(data)))
I'm cautious to benchmark on such a small sample of data though, because I suspect the results won't be quite comparable to benchmarking on your actual dataset.
Update: Results of some more benchmarks
Using @RicardoSaporta's benchmarking procedure, I have benchmarked data.table
against what I've called "Manual" data.frame
creation. You can see the results of the benchmarks here, on datasets ranging from 1000 rows to 3000 rows, in 500 row increments, and all with 8003 columns (8000 data columns, plus the three initial columns).
The results can be seen here: http://rpubs.com/mrdwab/reduce-computing-time
Ricardo's correct--there seems to be something about 3000 rows that makes a huge difference with the base R approaches (and it would be interesting if anyone has any explanation about what that might be). But this "Manual" approach is actually even faster than stack
, if performance really is the primary concern.
Here are the results for just the last three runs:
data <- makeSomeData(2000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 0.908 0.696 0.108 1
## 1 3.963 DT 3.598 3.564 0.012 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(2500, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 2 1.000 Manual 2.841 1.044 0.296 1
## 1 1.694 DT 4.813 4.661 0.080 1
rm(data, dateCols, nvc, dtt)
data <- makeSomeData(3000, 8000)
dtt <- data.table(data)
suppressWarnings(benchmark(DT = eval(DT), Manual = eval(Manual), replications = 1,
columns = c("relative", "test", "elapsed", "user.self", "sys.self", "replications"),
order = "relative"))
## relative test elapsed user.self sys.self replications
## 1 1.00 DT 7.223 5.769 0.112 1
## 2 29.27 Manual 211.416 1.560 0.952 1
Ouch! data.table
really turns the tables on that last run!