Is there a memory efficient and fast way to load big json files in python?
There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.
Update:
I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:
import ijson
for prefix, the_type, value in ijson.parse(open(json_file_name)):
print prefix, the_type, value
where prefix
is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType
describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array'
, and value
is the value of the object or None
if the_type
is an event like starting/ending a map/array.
The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py
to find what I was looking for.
So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:
Modularize your code. Do something like:
for json_file in list_of_files: process_file(json_file)
If you write
process_file()
in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job.Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via
subprocess.Popen
. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.
Hope this helps.
Yes.
You can use jsonstreamer SAX-like push parser that I have written which will allow you to parse arbitrary sized chunks, you can get it here and checkout the README for examples. Its fast because it uses the 'C' yajl library.