Ruby - Read file in batches
there's no universal way.
1) you can read file by chunks:
File.open('filename','r') do |f|
chunk = f.read(2048)
...
end
disadvantage: you can miss a substring if it'd be between chunks, i.e. you look for "SOME_TEXT", but "SOME_" is a last 5 bytes of 1st 2048-byte chunk, and "TEXT" is a 4 bytes of 2nd chunk
2) you can read file line-by-line
File.open('filename','r') do |f|
line = f.gets
...
end
disadvantage: this way it'd be 2x..5x slower than first method
With Lazy Enumerators and each_slice, you can get the best of both worlds. You don't need to worry about cutting lines in the middle, and you can iterate over multiple lines in a batch. batch_size
can be chosen freely.
header_lines = 1
batch_size = 2000
File.open("big_file") do |file|
file.lazy.drop(header_lines).each_slice(batch_size) do |lines|
# do something with batch of lines
end
end
It could be used to import a huge CSV file into a database:
require 'csv'
batch_size = 2000
File.open("big_data.csv") do |file|
headers = file.first
file.lazy.each_slice(batch_size) do |lines|
csv_rows = CSV.parse(lines.join, headers: headers)
# do something with 2000 csv rows, e.g. bulk insert them into a database
end
end