Select columns using pandas dataframe.query()
After playing around with this for a while and reading through the source code for DataFrame.query
, I can't figure out a way to do it.
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval()
for selecting columns and df.query()
for filtering on rows.
UPDATE:
javadba points out that the return value of eval
is not a dataframe. For example, to flesh out jreback's example a bit more...
df.eval('A')
returns a Pandas Series, but
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc
/loc
, e.g.
df.loc[0:4, ['A', 'C']]
output
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978
Dataframe.query
is more like the where clause in a SQL statement than the select part.
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
df['A'] or df.loc[:,'A']
or
df[['A','B']] or df.loc[:,['A','B']]
To use the .query
method you do something like
df.query('A > B')
which would return all the rows where the value in column A is greater than the value in column b.
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
df[df['A'] > df['B']]