How to calculate the counts of each distinct value in a pyspark dataframe?
I think you're looking to use the DataFrame idiom of groupBy and count.
For example, given the following dataframe, one state per row:
df = sqlContext.createDataFrame([('TX',), ('NJ',), ('TX',), ('CA',), ('NJ',)], ('state',))
| TX|
| NJ|
| TX|
| CA|
| NJ|
The following yields:
| TX| 2|
| NJ| 2|
| CA| 1|