Ruby: How to split a file into multiple files of a given size
[Updated] Wrote a short version without any helper variables and put everything in a method:
def chunker f_in, out_pref, chunksize = 1_073_741_824
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
fh_out << fh_in.read(chunksize)
end
end
end
end
chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)
Instead of a line loop you can use .read(length)
and do a loop only for the EOF
marker and the file cursor.
This takes care that the chunky files are never bigger than your desired chunk size.
On the other hand it never takes care for line breaks (\n
)!
Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001
).
This is only possible because .read(chunksize)
is used. In the second example below, it could not be used!
Update: Splitting with line break recognition
If your really need complete lines with \n
then use this modified code snippet:
def chunker f_in, out_pref, chunksize = 1_073_741_824
outfilenum = 1
File.open(f_in,"r") do |fh_in|
until fh_in.eof?
File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
loop do
line = fh_in.readline
fh_out << line
break if fh_out.size > (chunksize-line.length) || fh_in.eof?
end
end
outfilenum += 1
end
end
end
I had to introduce a helper variable line
because I want to ensure that the chunky file size is always below the chunksize
limit! If you don't do this extended check you will get also file sizes above the limit. The while
statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc
or other complex calculations will make the code more unreadable and not shorter than this example.)
Unfortunately you have to have a second EOF
check, because the last chunk iteration will mostly result in a smaller chunk.
Also two helper variables are needed: the line
is described above, the outfilenum
is needed, because the resulting file sizes mostly do not match the exact chunksize
.
For files of any size, split
will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:
system("split -C 1M -d test.txt ''")
The options are:
-C 1M
Put lines totalling no more than 1M in each chunk-d
Use decimal suffixes in the output filenamestest.txt
The name of the input file''
Use a blank output file prefix
Unless you're on Windows, this is the way to go.