.cartesian pyspark code example
Example 1: Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements (a, b) where a is in self and b is in other.
rdd = sc.parallelize([1, 2])
sorted(rdd.cartesian(rdd).collect())
# [(1, 1), (1, 2), (2, 1), (2, 2)]
Example 2: Sorts this RDD by the given keyfunc
# sortBy(keyfunc, ascending=true, numPartitions=None)
tmp = [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
sc.parallelize(tmp).sortBy(lambda x: x[0]).collect()
# [('1', 3), ('2', 5), ('a', 1), ('b', 2), ('d', 4)]
sc.parallelize(tmp).sortBy(lambda x: x[1]).collect()
# [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]