Calculate mean for selected rows for selected columns in pandas data frame

To select the rows of your dataframe you can use iloc, you can then select the columns you want using square brackets.

For example:

 df = pd.DataFrame(data=[[1,2,3]]*5, index=range(3, 8), columns = ['a','b','c'])

gives the following dataframe:

to select only the 3d and fifth row you can do:

df.iloc[[2,4]]

which returns:

   a  b  c
5  1  2  3
7  1  2  3

if you then want to select only columns b and c you use the following command:

df[['b', 'c']].iloc[[2,4]]

which yields:

   b  c
5  2  3
7  2  3

To then get the mean of this subset of your dataframe you can use the df.mean function. If you want the means of the columns you can specify axis=0, if you want the means of the rows you can specify axis=1

thus:

df[['b', 'c']].iloc[[2,4]].mean(axis=0)

returns:

b    2
c    3

As we should expect from the input dataframe.

For your code you can then do:

 df[column_list].iloc[row_index_list].mean(axis=0)

EDIT after comment: New question in comment: I have to store these means in another df/matrix. I have L1, L2, L3, L4...LX lists which tells me the index whose mean I need for columns C[1, 2, 3]. For ex: L1 = [0, 2, 3] , means I need mean of rows 0,2,3 and store it in 1st row of a new df/matrix. Then L2 = [1,4] for which again I will calculate mean and store it in 2nd row of the new df/matrix. Similarly till LX, I want the new df to have X rows and len(C) columns. Columns for L1..LX will remain same. Could you help me with this?

Answer:

If i understand correctly, the following code should do the trick (Same df as above, as columns I took 'a' and 'b':

first you loop over all the lists of rows, collection all the means as pd.series, then you concatenate the resulting list of series over axis=1, followed by taking the transpose to get it in the right format.

dfs = list()
for l in L:
    dfs.append(df[['a', 'b']].iloc[l].mean(axis=0))

mean_matrix = pd.concat(dfs, axis=1).T

You can select specific columns from a DataFrame by passing a list of indices to .iloc, for example:

df.iloc[:, [2,5,6,7,8]]

Will return a DataFrame containing those numbered columns (note: This uses 0-based indexing, so 2 refers to the 3rd column.)

To take a mean down of that column, you could use:

# Mean along 0 (vertical) axis: return mean for specified columns, calculated across all rows
df.iloc[:, [2,5,6,7,8]].mean(axis=0)

To take a mean across that column, you could use:

# Mean along 1 (horizontal) axis: return mean for each row, calculated across specified columns
df.iloc[:, [2,5,6,7,8]].mean(axis=1)

You can also supply specific indices for both axes to return a subset of the table:

df.iloc[[1,2,3,4], [2,5,6,7,8]]

For your specific example, you would do:

import pandas as pd
import numpy as np

df = pd.DataFrame( 
np.array([[1,2,3,0,5],[1,2,3,4,5],[1,1,1,6,1],[1,0,0,0,0]]),
columns=["a","b","c","d","q"],
index = [0,1,2,3]
)

#I want mean of 0, 2, 3 rows for each a, b, d columns
#. a b d
#0 1 1 2

df.iloc[ [0,2,3], [0,1,3] ].mean(axis=0)

Which outputs:

a    1.0
b    1.0
d    2.0
dtype: float64

Alternatively, to access via column names, first select on those:

df[ ['a','b','d'] ].iloc[ [0,1,3] ].mean(axis=0)

To answer the second part of your question (from the comments) you can join multiple DataFrames together using pd.concat. It is faster to accumulate the frames in a list and then pass to pd.concat in one go, e.g.

dfs = []
for ix in idxs:
    dfm = df.iloc[ [0,2,3], ix ].mean(axis=0)
    dfs.append(dfm)

dfm_summary = pd.concat(dfs, axis=1) # Stack horizontally

Calculate mean for selected rows for selected columns in pandas data frame

Tags:

Python

Pandas

Related

Recent Posts