How can I process huge JSON files as streams in Ruby, without consuming all memory?
Problem
json = Yajl::Parser.parse(file_stream)
When you invoke Yajl::Parser like this, the entire stream is loaded into memory to create your data structure. Don't do that.
Solution
Yajl provides Parser#parse_chunk, Parser#on_parse_complete, and other related methods that enable you to trigger parsing events on a stream without requiring that the whole IO stream be parsed at once. The README contains an example of how to use chunking instead.
The example given in the README is:
Or lets say you didn't have access to the IO object that contained JSON data, but instead only had access to chunks of it at a time. No problem!
(Assume we're in an EventMachine::Connection instance)
def post_init @parser = Yajl::Parser.new(:symbolize_keys => true) end def object_parsed(obj) puts "Sometimes one pays most for the things one gets for nothing. - Albert Einstein" puts obj.inspect end def connection_completed # once a full JSON object has been parsed from the stream # object_parsed will be called, and passed the constructed object @parser.on_parse_complete = method(:object_parsed) end def receive_data(data) # continue passing chunks @parser << data end
Or if you don't need to stream it, it'll just return the built object from the parse when it's done. NOTE: if there are going to be multiple JSON strings in the input, you must specify a block or callback as this is how yajl-ruby will hand you (the caller) each object as it's parsed off the input.
obj = Yajl::Parser.parse(str_or_io)
One way or another, you have to parse only a subset of your JSON data at a time. Otherwise, you are simply instantiating a giant Hash in memory, which is exactly the behavior you describe.
Without knowing what your data looks like and how your JSON objects are composed, it isn't possible to give a more detailed explanation than that; as a result, your mileage may vary. However, this should at least get you pointed in the right direction.
Both @CodeGnome's and @A. Rager's answer helped me understand the solution.
I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.
Your solutions seem to be json-stream and yajl-ffi. There's an example on both that're pretty similar (they're from the same guy):
def post_init
@parser = Yajl::FFI::Parser.new
@parser.start_document { puts "start document" }
@parser.end_document { puts "end document" }
@parser.start_object { puts "start object" }
@parser.end_object { puts "end object" }
@parser.start_array { puts "start array" }
@parser.end_array { puts "end array" }
@parser.key {|k| puts "key: #{k}" }
@parser.value {|v| puts "value: #{v}" }
end
def receive_data(data)
begin
@parser << data
rescue Yajl::FFI::ParserError => e
close_connection
end
end
There, he sets up the callbacks for possible data events that the stream parser can experience.
Given a json document that looks like:
{
1: {
name: "fred",
color: "red",
dead: true,
},
2: {
name: "tony",
color: "six",
dead: true,
},
...
n: {
name: "erik",
color: "black",
dead: false,
},
}
One could stream parse it with yajl-ffi something like this:
def parse_dudes file_io, chunk_size
parser = Yajl::FFI::Parser.new
object_nesting_level = 0
current_row = {}
current_key = nil
parser.start_object { object_nesting_level += 1 }
parser.end_object do
if object_nesting_level.eql? 2
yield current_row #here, we yield the fully collected record to the passed block
current_row = {}
end
object_nesting_level -= 1
end
parser.key do |k|
if object_nesting_level.eql? 2
current_key = k
elsif object_nesting_level.eql? 1
current_row["id"] = k
end
end
parser.value { |v| current_row[current_key] = v }
file_io.each(chunk_size) { |chunk| parser << chunk }
end
File.open('dudes.json') do |f|
parse_dudes f, 1024 do |dude|
pp dude
end
end