'PipelinedRDD' object has no attribute 'toDF' in PySpark
toDF
method is a monkey patch executed inside SparkSession
(SQLContext
constructor in 1.x) constructor so to be able to use it you have to create a SQLContext
(or SparkSession
) first:
# SQLContext or HiveContext in Spark 1.x
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext()
rdd = sc.parallelize([("a", 1)])
hasattr(rdd, "toDF")
## False
spark = SparkSession(sc)
hasattr(rdd, "toDF")
## True
rdd.toDF().show()
## +---+---+
## | _1| _2|
## +---+---+
## | a| 1|
## +---+---+
Not to mention you need a SQLContext
or SparkSession
to work with DataFrames
in the first place.
Make sure you have spark session too.
sc = SparkContext("local", "first app")
spark = SparkSession(sc)