Faster way to accomplish this Pandas job than by using Apply for large data set?
My shot at the problem:
- extract all object_a's under a Parent Name where i) there are >1 object_a's and; ii) the object_a has 0 ticks but the other object_a has >0 ticks. i.e. just the one with zero ticks
- extract all object_b's under a Parent Name where i) there is >=1 object_a and; ii) the object_b has 0 ticks but the object_a has >0 ticks
My first impression when reading this is that the actual "Type" doesn't really matter, we just want an existing object_a
with >0 Ticks for each group, and extract all the elements with 0 ticks, regardless of their type.
Considering that, my approach was first to create a new column to count the number of object_a
ticks for any parent. If this number is >0, it means that at least 1 object_a
exists with Ticks>0.
In [63]: df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())
Out[63]:
Parent Name
3217863 2
4556421 34
dtype: int64
Let's now merge that into the original DataFrame...
In [64]: sumATicks = df.groupby(['Parent Name']).apply(lambda x: x[x['Type'] == 'object_a']['Ticks'].sum())
In [65]: merged = df.merge(pd.DataFrame(sumATicks).rename(columns={0: 'nbATicks'}), left_on='Parent Name', right_index=True)
In [66]: merged
Out[66]:
Type Parent Name Ticks nbATicks
0 object_a 4556421 34 34
1 object_a 4556421 0 34
2 object_b 4556421 0 34
3 object_a 3217863 2 2
4 object_b 3217863 1 2
...and extract all the interesting rows, according to the criteria I stated above:
In [67]: merged[(merged['nbATicks'] > 0) & (merged['Ticks'] == 0)]
Out[67]:
Type Parent Name Ticks nbATicks
1 object_a 4556421 0 34
2 object_b 4556421 0 34
Hopefully I didn't forget any fringe case...
Regarding the chunk problem, why don't you just load the whole csv file in memory ? If it's that big, you can try sorting by ParentName before processing, and splitting the chunks at relevant places.
Here comes my idea for the problem:
I think the first objective is easier because we only depend on rows with object_a. We can use transform to convert the conditions into boolean list:
df_1 = df.loc[df['Type']=='object_a']
object_a = df_1.loc[(df_1.groupby('Parent_Name')['Ticks'].transform(min)==0)&
(df_1.groupby('Parent_Name')['Ticks'].transform(max)>0)&
(a['Ticks']==0)
]
Out[1]:
Type Parent_Name Ticks
1 object_a 4556421 0
For the second objective i create a list of Parent_Names meeting the requirements for object_a. In the next step isin is used to selected only the corresponding rows.
a_condition = df.loc[df['Type']=='object_a'].groupby('Parent_Name').sum()
a_condition = a_condition[a_condition>0].index
object_b = df.loc[(df['Type']=='object_b')&
(df['Ticks']==0)&
(df['Parent_Name'].isin(a_condition))
]
Out[2]:
Type Parent_Name Ticks
2 object_b 4556421 0
In [35]: df
Out[32]:
Type Parent Name Ticks
0 object_a 4556421 34
1 object_a 4556421 0
2 object_b 4556421 0
3 object_a 3217863 2
4 object_b 3217863 1
Aggregate the data into tuple
s
In [33]: df1 = df.groupby(['Parent Name',
'Type']).agg(lambda x: tuple(x)).unstack(1)
In [34]: df1
Out[34]:
Ticks
Type object_a object_b
Parent Name
3217863 (2,) (1,)
4556421 (34, 0) (0,)
Build the Boolean mask for your case #1
In [35]: mask1 = df1.apply(lambda x: (len(x[0])>1) & ((x[0]).count(0)==1),
axis=1)
In [36]: mask1
Out[36]:
Parent Name
3217863 False
4556421 True
dtype: bool
Build Boolean mask for your case #2
In [37]: mask2 = df1.apply(lambda x: ((len(x[0])>=1) &
(len(set(x[0]).difference([0]))>0) &
(len(x[1])==1) &
(x[1][0]==0)),
axis=1)
In [38]: mask2
Out[38]:
Parent Name
3217863 False
4556421 True
dtype: bool
Get the result for case #1
In [39]: df1.loc[mask1, [('Ticks', 'object_a')]]
Out[39]:
Ticks
Type object_a
Parent Name
4556421 (34, 0)
Get the result for case #2
In [30]: df1.loc[mask2, [('Ticks', 'object_b')]]
Out[30]:
Ticks
Type object_b
Parent Name
4556421 (0,)