show distinct column values in pyspark dataframe: python
This should help to get distinct values of a column:
df.select('column1').distinct().collect()
Note that .collect()
doesn't have any built-in limit on how many values can return so this might be slow -- use .show()
instead or add .limit(20)
before .collect()
to manage this.
Let's assume we're working with the following representation of data (two columns, k
and v
, where k
contains three entries, two unique:
+---+---+
| k| v|
+---+---+
|foo| 1|
|bar| 2|
|foo| 3|
+---+---+
With a Pandas dataframe:
import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()
This returns an ndarray
, i.e. array(['foo', 'bar'], dtype=object)
You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:
s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))
If you want the same result from Spark, i.e. an ndarray
, use toPandas()
:
s_df.toPandas()['k'].unique()
Alternatively, if you don't need an ndarray
specifically and just want a list of the unique values of column k
:
s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()
Finally, you can also use a list comprehension as follows:
[i.k for i in s_df.select('k').distinct().collect()]
You can use df.dropDuplicates(['col1','col2'])
to get only distinct rows based on colX in the array.