Executing an SQL query over a pandas dataset
Much better solution is to use duckdb. It is much faster than sqldf because it does not have to load the entire data into sqlite and load back to pandas.
pip install duckdb
import pandas as pd
import duckdb
test_df = pd.DataFrame.from_dict({"i":[1, 2, 3, 4], "j":["one", "two", "three", "four"]})
duckdb.query("SELECT * FROM test_df where i>2").df() # returns a result dataframe
Performance improvement over pandasql: test data NYC yellow cabs ~120mb of csv data
nyc = pd.read_csv('https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv',low_memory=False)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
pysqldf("SELECT * FROM nyc where trip_distance>10")
# wall time 16.1s
duckdb.query("SELECT * FROM nyc where trip_distance>10").df()
# wall time 183ms
A improvement of speed of roughly 100x
This article gives good details and claims 1000x improvement over pandasql: https://duckdb.org/2021/05/14/sql-on-pandas.html
You can use DataFrame.query(condition)
to return a subset of the data frame matching condition
like this:
df = pd.DataFrame(np.arange(9).reshape(3,3), columns=list('ABC'))
df
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df.query('C < 6')
A B C
0 0 1 2
1 3 4 5
df.query('2*B <= C')
A B C
0 0 1 2
df.query('A % 2 == 0')
A B C
0 0 1 2
2 6 7 8
This is basically the same effect as an SQL statement, except the SELECT * FROM df WHERE
is implied.
After some time of using this I realised the easiest way is to just do
from pandasql import sqldf
output = sqldf("select * from df")
Works like a charm where df
is a pandas dataframe
You can install pandasql: https://pypi.org/project/pandasql/
This is not what pandas.query
is supposed to do. You can look at package pandasql
(same like sqldf
in R )
import pandas as pd
import pandasql as ps
df = pd.DataFrame([[1234, 'Customer A', '123 Street', np.nan],
[1234, 'Customer A', np.nan, '333 Street'],
[1233, 'Customer B', '444 Street', '333 Street'],
[1233, 'Customer B', '444 Street', '666 Street']], columns=
['ID', 'Customer', 'Billing Address', 'Shipping Address'])
q1 = """SELECT ID FROM df """
print(ps.sqldf(q1, locals()))
ID
0 1234
1 1234
2 1233
3 1233
Update 2020-07-10
update the
pandasql
ps.sqldf("select * from df")