Pull large amounts of data from a remote server, into a DataFrame
I suspect there's a couple of (related) things at play here causing slowness:
read_sql
is written in python so it's a little slow (especially compared toread_csv
, which is written in cython - and carefully implemented for speed!) and it relies on sqlalchemy rather than some (potentially much faster) C-DBAPI. The impetus to move to sqlalchmey was to make that move easier in the future (as well as cross-sql-platform support).- You may be running out of memory as too many python objects are in memory (this is related to not using a C-DBAPI), but potentially could be addressed...
I think the immediate solution is a chunk-based approach (and there is a feature request to have this work natively in pandas read_sql
and read_sql_table
).
EDIT: As of Pandas v0.16.2 this chunk based approach is natively implemented in read_sql
.
Since you're using postgres you have access the the LIMIT and OFFSET queries, which makes chunking quite easy. (Am I right in thinking these aren't available in all sql languages?)
First, get the number of rows (or an estimate) in your table:
nrows = con.execute('SELECT count(*) FROM users').fetchone()[0] # also works with an sqlalchemy engine
Use this to iterate through the table (for debugging you could add some print statements to confirm that it was working/not crashed!) and then combine the result:
def read_sql_chunked(query, con, nrows, chunksize=1000):
start = 1
dfs = [] # Note: could probably make this neater with a generator/for loop
while start < nrows:
df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con)
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
Note: this assumes that the database fits in memory! If it doesn't you'll need to work on each chunk (mapreduce style)... or invest in more memory!