PySpark groupByKey returning pyspark.resultiterable.ResultIterable
In addition to above answers, if you want the sorted list of unique items, use following:
List of Distinct and Sorted Values
example.groupByKey().mapValues(set).mapValues(sorted)
Just List of Sorted Values
example.groupByKey().mapValues(sorted)
Alternative's to above
# List of distinct sorted items
example.groupByKey().map(lambda x: (x[0], sorted(set(x[1]))))
# just sorted list of items
example.groupByKey().map(lambda x: (x[0], sorted(x[1])))
Instead of using groupByKey(), i would suggest you use cogroup(). You can refer the below example.
[(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
Example:
>>> x = sc.parallelize([("foo", 1), ("bar", 4)])
>>> y = sc.parallelize([("foo", -1)])
>>> z = [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
>>> print(z)
You should get the desired output...
What you're getting back is an object which allows you to iterate over the results. You can turn the results of groupByKey into a list by calling list() on the values, e.g.
example = sc.parallelize([(0, u'D'), (0, u'D'), (1, u'E'), (2, u'F')])
example.groupByKey().collect()
# Gives [(0, <pyspark.resultiterable.ResultIterable object ......]
example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
# Gives [(0, [u'D', u'D']), (1, [u'E']), (2, [u'F'])]
you can also use
example.groupByKey().mapValues(list)