How to do/workaround a conditional join in python Pandas?
Well, I can think of a few ways:
- essentially blow up the dataframe by just merging on the exact field (
company
)... then filter on the 30-day windows after the merge.
- should be fast but could use lots of memory
- Move the merging and filtering on the 30-day window into a
groupby()
.
- results in a merge for each group, so slower but should use less memory
Option #1
Suppose your data looks like the following (I expanded your sample data):
print df
company date measure
0 0 2010-01-01 10
1 0 2010-01-15 10
2 0 2010-02-01 10
3 0 2010-02-15 10
4 0 2010-03-01 10
5 0 2010-03-15 10
6 0 2010-04-01 10
7 1 2010-03-01 5
8 1 2010-03-15 5
9 1 2010-04-01 5
10 1 2010-04-15 5
11 1 2010-05-01 5
12 1 2010-05-15 5
print windows
company end_date
0 0 2010-02-01
1 0 2010-03-15
2 1 2010-04-01
3 1 2010-05-15
Create a beginning date for the 30 day windows:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print windows
company end_date beg_date
0 0 2010-02-01 2010-01-02
1 0 2010-03-15 2010-02-13
2 1 2010-04-01 2010-03-02
3 1 2010-05-15 2010-04-15
Now do a merge and then select based on if date
falls within beg_date
and end_date
:
df = df.merge(windows,on='company',how='left')
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df
company date measure end_date beg_date
2 0 2010-01-15 10 2010-02-01 2010-01-02
4 0 2010-02-01 10 2010-02-01 2010-01-02
7 0 2010-02-15 10 2010-03-15 2010-02-13
9 0 2010-03-01 10 2010-03-15 2010-02-13
11 0 2010-03-15 10 2010-03-15 2010-02-13
16 1 2010-03-15 5 2010-04-01 2010-03-02
18 1 2010-04-01 5 2010-04-01 2010-03-02
21 1 2010-04-15 5 2010-05-15 2010-04-15
23 1 2010-05-01 5 2010-05-15 2010-04-15
25 1 2010-05-15 5 2010-05-15 2010-04-15
You can compute the 30 day window sums by grouping on company
and end_date
:
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Option #2 Move all merging into a groupby. This should be better on memory but I would think much slower:
windows['beg_date'] = (windows['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
def cond_merge(g,windows):
g = g.merge(windows,on='company',how='left')
g = g[(g.date >= g.beg_date) & (g.date <= g.end_date)]
return g.groupby('end_date')['measure'].sum()
print df.groupby('company').apply(cond_merge,windows)
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another option Now if your windows never overlap (like in the example data), you could do something like the following as an alternative that doesn't blow up a dataframe but is pretty fast:
windows['date'] = windows['end_date']
df = df.merge(windows,on=['company','date'],how='outer')
print df
company date measure end_date
0 0 2010-01-01 10 NaT
1 0 2010-01-15 10 NaT
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 NaT
4 0 2010-03-01 10 NaT
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 NaT
8 1 2010-03-15 5 NaT
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 NaT
11 1 2010-05-01 5 NaT
12 1 2010-05-15 5 2010-05-15
This merge essentially inserts your window end dates into the dataframe and then backfilling the end dates (by group) will give you a structure to easily create you summation windows:
df['end_date'] = df.groupby('company')['end_date'].apply(lambda x: x.bfill())
print df
company date measure end_date
0 0 2010-01-01 10 2010-02-01
1 0 2010-01-15 10 2010-02-01
2 0 2010-02-01 10 2010-02-01
3 0 2010-02-15 10 2010-03-15
4 0 2010-03-01 10 2010-03-15
5 0 2010-03-15 10 2010-03-15
6 0 2010-04-01 10 NaT
7 1 2010-03-01 5 2010-04-01
8 1 2010-03-15 5 2010-04-01
9 1 2010-04-01 5 2010-04-01
10 1 2010-04-15 5 2010-05-15
11 1 2010-05-01 5 2010-05-15
12 1 2010-05-15 5 2010-05-15
df = df[df.end_date.notnull()]
df['beg_date'] = (df['end_date'].values.astype('datetime64[D]') -
np.timedelta64(30,'D'))
print df
company date measure end_date beg_date
0 0 2010-01-01 10 2010-02-01 2010-01-02
1 0 2010-01-15 10 2010-02-01 2010-01-02
2 0 2010-02-01 10 2010-02-01 2010-01-02
3 0 2010-02-15 10 2010-03-15 2010-02-13
4 0 2010-03-01 10 2010-03-15 2010-02-13
5 0 2010-03-15 10 2010-03-15 2010-02-13
7 1 2010-03-01 5 2010-04-01 2010-03-02
8 1 2010-03-15 5 2010-04-01 2010-03-02
9 1 2010-04-01 5 2010-04-01 2010-03-02
10 1 2010-04-15 5 2010-05-15 2010-04-15
11 1 2010-05-01 5 2010-05-15 2010-04-15
12 1 2010-05-15 5 2010-05-15 2010-04-15
df = df[(df.date >= df.beg_date) & (df.date <= df.end_date)]
print df.groupby(['company','end_date']).sum()
measure
company end_date
0 2010-02-01 20
2010-03-15 30
1 2010-04-01 10
2010-05-15 15
Another alternative is to resample your first dataframe to daily data and then compute rolling_sums with a 30 day window; and select the dates at the end that you are interested in. This could be quite memory intensive too.
There is a very easy, and practical (or maybe the only direct way) to do conditional join in pandas. Since there is no direct way to do conditional join in pandas, you will need an additional library, and that is, pandasql
Install the library pandasql
from pip using the command pip install pandasql
. This library allows you to manipulate the pandas dataframes using the SQL queries.
import pandas as pd
from pandasql import sqldf
df = pd.read_excel(r'play_data.xlsx')
df
id Name Amount
0 A001 A 100
1 A002 B 110
2 A003 C 120
3 A005 D 150
Now let's just do a conditional join to compare the Amount of the IDs
# Make your pysqldf object:
pysqldf = lambda q: sqldf(q, globals())
# Write your query in SQL syntax, here you can use df as a normal SQL table
cond_join= '''
select
df_left.*,
df_right.*
from df as df_left
join df as df_right
on
df_left.[Amount] > (df_right.[Amount]+10)
'''
# Now, get your queries results as dataframe using the sqldf object that you created
pysqldf(cond_join)
id Name Amount id Name Amount
0 A003 C 120 A001 A 100
1 A005 D 150 A001 A 100
2 A005 D 150 A002 B 110
3 A005 D 150 A003 C 120