Python Pandas Counting the Occurrences of a Specific value
Couple of ways using count
or sum
In [338]: df
Out[338]:
col1 education
0 a 9th
1 b 9th
2 c 8th
In [335]: df.loc[df.education == '9th', 'education'].count()
Out[335]: 2
In [336]: (df.education == '9th').sum()
Out[336]: 2
In [337]: df.query('education == "9th"').education.count()
Out[337]: 2
You can create subset
of data with your condition and then use shape
or len
:
print df
col1 education
0 a 9th
1 b 9th
2 c 8th
print df.education == '9th'
0 True
1 True
2 False
Name: education, dtype: bool
print df[df.education == '9th']
col1 education
0 a 9th
1 b 9th
print df[df.education == '9th'].shape[0]
2
print len(df[df['education'] == '9th'])
2
Performance is interesting, the fastest solution is compare numpy array and sum
:
Code:
import perfplot, string
np.random.seed(123)
def shape(df):
return df[df.education == 'a'].shape[0]
def len_df(df):
return len(df[df['education'] == 'a'])
def query_count(df):
return df.query('education == "a"').education.count()
def sum_mask(df):
return (df.education == 'a').sum()
def sum_mask_numpy(df):
return (df.education.values == 'a').sum()
def make_df(n):
L = list(string.ascii_letters)
df = pd.DataFrame(np.random.choice(L, size=n), columns=['education'])
return df
perfplot.show(
setup=make_df,
kernels=[shape, len_df, query_count, sum_mask, sum_mask_numpy],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Try this:
(df[education]=='9th').sum()
An elegant way to count the occurrence of '?'
or any symbol in any column, is to use built-in function isin
of a dataframe object.
Suppose that we have loaded the 'Automobile' dataset into df
object.
We do not know which columns contain missing value ('?'
symbol), so let do:
df.isin(['?']).sum(axis=0)
DataFrame.isin(values)
official document says:
it returns boolean DataFrame showing whether each element in the DataFrame is contained in values
Note that isin
accepts an iterable as input, thus we need to pass a list containing the target symbol to this function. df.isin(['?'])
will return a boolean dataframe as follows.
symboling normalized-losses make fuel-type aspiration-ratio ...
0 False True False False False
1 False True False False False
2 False True False False False
3 False False False False False
4 False False False False False
5 False True False False False
...
To count the number of occurrence of the target symbol in each column, let's take sum
over all the rows of the above dataframe by indicating axis=0
.
The final (truncated) result shows what we expect:
symboling 0
normalized-losses 41
...
bore 4
stroke 4
compression-ratio 0
horsepower 2
peak-rpm 2
city-mpg 0
highway-mpg 0
price 4