'PipelinedRDD' object has no attribute 'toDF' in PySpark

toDF method is a monkey patch executed inside SparkSession (SQLContext constructor in 1.x) constructor so to be able to use it you have to create a SQLContext (or SparkSession) first:

# SQLContext or HiveContext in Spark 1.x
from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext()

rdd = sc.parallelize([("a", 1)])
hasattr(rdd, "toDF")
## False

spark = SparkSession(sc)
hasattr(rdd, "toDF")
## True

rdd.toDF().show()
## +---+---+
## | _1| _2|
## +---+---+
## |  a|  1|
## +---+---+

Not to mention you need a SQLContext or SparkSession to work with DataFrames in the first place.

Make sure you have spark session too.

sc = SparkContext("local", "first app")
spark = SparkSession(sc)

'PipelinedRDD' object has no attribute 'toDF' in PySpark

Tags:

Python

Apache Spark

Rdd

Pyspark

Apache Spark Sql

Related

Recent Posts