How to get rolling pandas dataframe subsets
The trick is to define a function that has access to your entire dataframe. Then you do a roll on any column and call apply()
passing in that function. The function will have access to the window data, which is a subset of the dataframe column. From that subset you can extract the index you should be looking at. (This assumes that your index is strictly increasing. So the usual integer index will work, as well as most time series.) You can use the index to then access the entire dataframe with all the columns.
def dataframe_roll(df):
def my_fn(window_series):
window_df = df[(df.index >= window_series.index[0]) & (df.index <= window_series.index[-1])]
return window_df["col1"] + window_df["col2"]
return my_fn
df["result"] = df["any_col"].rolling(24).apply(dataframe_roll(df), raw=False)
Here's how you get dataframe subsets in a rolling manner:
for df_subset in df.rolling(2):
print(type(df_subset), '\n', df_subset)
updated comment
@unutbu posted a great answer to a very similar question here but it appears that his answer is based on pd.rolling_apply
which passes the index to the function. I'm not sure how to replicate this with the current DataFrame.rolling.apply
method.
original answer
It appears that the variable passed to the argument through the apply
function is a numpy array of each column (one at a time) and not a DataFrame so you do not have access to any other columns unfortunately.
But what you can do is use some boolean logic to temporarily create a new column based on whether var2
is 74 or not and then use the rolling method.
df['new_var'] = df.var2.eq(74).mul(df.var1).rolling(2, min_periods=1).sum()
var1 var2 new_var
0 43 74 43.0
1 44 74 87.0
2 45 66 44.0
3 46 268 0.0
4 47 66 0.0
The temporary column is based on the first half of the code above.
df.var2.eq(74).mul(df.var1)
# or equivalently with operators
# (df['var2'] == 74) * df['var1']
0 43
1 44
2 0
3 0
4 0
Finding the type of the variable passed to apply
Its very important to know what is actually being passed to the apply function and I can't always remember what is being passed so if I am unsure I will print out the variable along with its type so that it is clear to me what object I am dealing with. See this example with your original DataFrame.
def foo(x):
print(x)
print(type(x))
return x.sum()
df.rolling(2, min_periods=1).apply(foo)
Output
[ 43.]
<class 'numpy.ndarray'>
[ 43. 44.]
<class 'numpy.ndarray'>
[ 44. 45.]
<class 'numpy.ndarray'>
[ 45. 46.]
<class 'numpy.ndarray'>
[ 46. 47.]
<class 'numpy.ndarray'>
[ 74.]
<class 'numpy.ndarray'>
[ 74. 74.]
<class 'numpy.ndarray'>
[ 74. 66.]
<class 'numpy.ndarray'>
[ 66. 268.]
<class 'numpy.ndarray'>
[ 268. 66.]
<class 'numpy.ndarray'>