How to keep the only the top N values in a dataframe
You can use np.unique
to sort and find the 5th largest value, and use where
:
uniques = np.unique(df)
# what happens if len(uniques) < 5?
thresh = uniques[-5]
df.where(df >= thresh)
Output:
day1 day2 day3 day4 day5
larry NaN 4.0 4 3 5
gunnar NaN NaN 3 4 4
tin NaN 5.0 5 6 7
Update: On the second look, I think you can do:
df.apply(pd.Series.nlargest, n=3,axis=1).reindex(df.columns, axis=1)
Output:
day1 day2 day3 day4 day5
larry NaN 4.0 4.0 NaN 5.0
gunnar NaN NaN 3.0 4.0 4.0
tin NaN 5.0 NaN 6.0 7.0
To keep, for every row, the top 3 values you can use:
df = (df * df.astype(float).apply(lambda x: x.isin(x.nlargest(3)), axis=1)).replace('', np.nan)
You can migrate nlargest
's parameter into a variable if needed.
Output:
day1 day2 day3 day4 day5
larry NaN 4 4 NaN 5
gunnar NaN NaN 3 4 4
tin NaN 5 5 6 7
stack
, sort
, keep the top 3 per row with a groupby
, and then unstack
. We need a mergesort
to ensure we always keep the first value per row in the case of ties. Reindex ensures we have all the original columns if there are no largest values in any of the rows for that column.
If you require Int
in the end, the best you can do is 'Int64'
, pandas nullable-integer dtype.
# df = df.apply(pd.to_numeric) # To make numeric, not object.
N = 3
(df.stack()
.sort_values(ascending=False, , kind='mergesort')
.groupby(level=0).head(N)
.unstack()
.reindex(df.columns, axis=1)
.astype('Int64'))
day1 day2 day3 day4 day5
larry <NA> 4 4 <NA> 5
gunnar <NA> <NA> 3 4 4
tin <NA> 5 <NA> 6 7
Here are some timings vs the number of rows, and proof that these three methods are equivalent (before any type-casting). @anky's solution is very fast and simple.
import perfplot
import pandas as pd
import numpy as np
def alollz_stack(df, N=3):
return (df.stack()
.sort_values(ascending=False, kind='mergesort')
.groupby(level=0).head(N)
.unstack()
.reindex(df.columns, axis=1))
def quang_nlargest(df, N=3):
return df.apply(pd.Series.nlargest, n=N, axis=1).reindex(df.columns, axis=1)
def anky_rank(df, N=3):
return df[df.iloc[:,::-1].rank(1,'first').ge(df.shape[1]-N+1)]
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.randint(0, 100, (n, 5)),
columns=[f'col{i}' for i in range(1,6)]),
kernels=[
lambda df: alollz_stack(df),
lambda df: quang_nlargest(df),
lambda df: anky_rank(df)],
labels=['Stack', 'Apply_Nlargest', 'Rank'],
n_range=[2 ** k for k in range(18)],
equality_check=lambda x,y: np.allclose(x,y, equal_nan=True),
xlabel='N_rows'
)