Strategies for reading in CSV files in pieces?

After reviewing this thread I noticed a conspicuous solution to this problem was not mentioned. Use connections!

1) Open a connection to your file

con = file("file.csv", "r")

2) Read in chunks of code with read.csv

read.csv(con, nrows="CHUNK SIZE",...)

Side note: defining colClasses will greatly speed things up. Make sure to define unwanted columns as NULL.

3) Do what ever you need to do

4) Repeat.

5) Close the connection

close(con)

The advantage of this approach is connections. If you omit this step, it will likely slow things down a bit. By opening a connection manually, you essentially open the data set and do not close it until you call the close function. This means that as you loop through the data set you will never lose your place. Imagine that you have a data set with 1e7 rows. Also imagine that you want to load a chunk of 1e5 rows at a time. Since we open the connection we get the first 1e5 rows by running read.csv(con, nrow=1e5,...), then to get the second chunk we run read.csv(con, nrow=1e5,...) as well, and so on....

If we did not use the connections we would get the first chunk the same way, read.csv("file.csv", nrow=1e5,...), however for the next chunk we would need to read.csv("file.csv", skip = 1e5, nrow=2e5,...). Clearly this is inefficient. We are have to find the 1e5+1 row all over again, despite the fact that we just read in the 1e5 row.

Finally, data.table::fread is great. But you can not pass it connections. So this approach does not work.

I hope this helps someone.

UPDATE

People keep upvoting this post so I thought I would add one more brief thought. The new readr::read_csv, like read.csv, can be passed connections. However, it is advertised as being roughly 10x faster.

You could read it into a database using RSQLite, say, and then use an sql statement to get a portion.

If you need only a single portion then read.csv.sql in the sqldf package will read the data into an sqlite database. First, it creates the database for you and the data does not go through R so limitations of R won't apply (which is primarily RAM in this scenario). Second, after loading the data into the database , sqldf reads the output of a specified sql statement into R and finally destroys the database. Depending on how fast it works with your data you might be able to just repeat the whole process for each portion if you have several.

Only one line of code accomplishes all three steps, so it's a no-brainer to just try it.

DF <- read.csv.sql("myfile.csv", sql=..., ...other args...)

See ?read.csv.sql and ?sqldf and also the sqldf home page.

Strategies for reading in CSV files in pieces?

Tags:

R

Bigdata

Related

Recent Posts