Pandas, for each unique value in one column, get unique values in another column

By using sacul's sample data

df['subreddit'].groupby(df['author']).unique().apply(pd.Series)
Out[370]: 
          0    1
author          
a       sr1  sr2
b       sr2  NaN

Using groupby.agg() "aggrgeate" function:

DataFrameGroupBy.agg(arg, *args, **kwargs): aggregate using one or more operations over the specified axis. Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply

df = pd.DataFrame({'numbers': [1, 2, 3, 6, 9], 'colors': ['red', 'white', 'blue', 'red', 'white']}, columns=['numbers', 'colors'])

enter image description here

df.groupby('colors', as_index=True).agg({'numbers' : {"unique" : lambda x: set(x),
                                                      "nunique" : lambda x : len(set(x))}})

enter image description here

Here are two strategies to do it. No doubt, there are other ways.

Assuming your dataframe looks something like this (obviously with more columns):

df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})

>>> df
  author subreddit
0      a       sr1
1      a       sr2
2      b       sr2
...

SOLUTION 1: groupby

More straightforward than solution 2, and similar to your first attempt:

group = df.groupby('author')

df2 = group.apply(lambda x: x['subreddit'].unique())

# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())

Result:

>>> df2
author
a    [sr1, sr2]
b         [sr2]

The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:

df2 = df2.apply(pd.Series)

Result:

>>> df2
          0    1
author          
a       sr1  sr2
b       sr2  NaN

Solution 2: Iterate through dataframe

you can make a new dataframe with all unique authors:

df2 = pd.DataFrame({'author':df.author.unique()})

And then just get the list of all unique subreddits they are active in, assigning it to a new column:

df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
    for _, x in df2.iterrows()]

This gives you this:

>>> df2
  author  subreddits
0      a  [sr2, sr1]
1      b       [sr2]

Pandas, for each unique value in one column, get unique values in another column

Tags:

Python

Pandas

Related

Recent Posts