.cartesian pyspark code example

Example 1: Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.

rdd = sc.parallelize([1, 2])
sorted(rdd.cartesian(rdd).collect())
# [(1, 1), (1, 2), (2, 1), (2, 2)]

Example 2: Sorts this RDD by the given keyfunc

# sortBy(keyfunc, ascending=true, numPartitions=None)

tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
sc.parallelize(tmp).sortBy(lambda x: x[0]).collect()
# [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
sc.parallelize(tmp).sortBy(lambda x: x[1]).collect()
# [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]