View RDD contents in Python Spark?
In Spark 2.0 (I didn't tested with earlier versions). Simply:
print myRDD.take(n)
Where n is the number of lines and myRDD is wc in your case.
This error is because print
isn't a function in Python 2.6.
You can either define a helper UDF that performs the print, or use the __future__ library to treat print
as a function:
>>> from operator import add
>>> f = sc.textFile("README.md")
>>> def g(x):
... print x
...
>>> wc.foreach(g)
or
>>> from __future__ import print_function
>>> wc.foreach(print)
However, I think it would be better to use collect()
to bring the RDD contents back to the driver, because foreach
executes on the worker nodes and the outputs may not necessarily appear in your driver / shell (it probably will in local
mode, but not when running on a cluster).
>>> for x in wc.collect():
... print x