Convert RDD to Dataframe in Spark/Scala
Just paste into a spark-shell
:
val a =
Array(
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"),
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"))
val rdd = sc.makeRDD(a)
case class X(callId: String, oCallId: String,
callTime: String, duration: String, calltype: String, swId: String)
Then map()
over the RDD to create instances of the case class, and then create the DataFrame using toDF()
:
scala> val df = rdd.map {
case Array(s0, s1, s2, s3, s4, s5) => X(s0, s1, s2, s3, s4, s5) }.toDF()
df: org.apache.spark.sql.DataFrame =
[callId: string, oCallId: string, callTime: string,
duration: string, calltype: string, swId: string]
This infers the schema from the case class.
Then you can proceed with:
scala> df.printSchema()
root
|-- callId: string (nullable = true)
|-- oCallId: string (nullable = true)
|-- callTime: string (nullable = true)
|-- duration: string (nullable = true)
|-- calltype: string (nullable = true)
|-- swId: string (nullable = true)
scala> df.show()
+----------+-------+-------------------+--------+--------+----+
| callId|oCallId| callTime|duration|calltype|swId|
+----------+-------+-------------------+--------+--------+----+
|4580056797| 0|2015-07-29 10:38:42| 0| 1| 1|
|4580056797| 0|2015-07-29 10:38:42| 0| 1| 1|
+----------+-------+-------------------+--------+--------+----+
If you want to use toDF()
in a normal program (not in the spark-shell
), make sure (quoted from here):
- To
import sqlContext.implicits._
right after creating theSQLContext
- Define the case class outside of the method using
toDF()
You need to convert first you Array
into Row
and then define schema. I made assumption that most of your fields are Long
val rdd: RDD[Array[String]] = ???
val rows: RDD[Row] = rdd map {
case Array(callId, oCallId, callTime, duration, swId) =>
Row(callId.toLong, oCallId.toLong, callTime, duration.toLong, swId.toLong)
}
object schema {
val callId = StructField("callId", LongType)
val oCallId = StructField("oCallId", StringType)
val callTime = StructField("callTime", StringType)
val duration = StructField("duration", LongType)
val swId = StructField("swId", LongType)
val struct = StructType(Array(callId, oCallId, callTime, duration, swId))
}
sqlContext.createDataFrame(rows, schema.struct)
Using spark 1.6.1
and scala 2.10
I got the same error error: overloaded method value createDataFrame with alternatives:
For me, gotcha was the signature in createDataFrame
, I was trying to use the val rdd : List[Row]
, but it failed
because java.util.List[org.apache.spark.sql.Row]
and scala.collection.immutable.List[org.apache.spark.sql.Row]
are NOT the same.
The working solution I've found is I would convert val rdd : Array[Array[String]]
into RDD[Row]
via List[Array[String]]
. I find this is the closest to what's in the documentation
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType};
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val rdd_original : Array[Array[String]] = Array(
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"),
Array("4580056797", "0", "2015-07-29 10:38:42", "0", "1", "1"))
val rdd : List[Array[String]] = rdd_original.toList
val schemaString = "callId oCallId callTime duration calltype swId"
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Convert records of the RDD to Rows.
val rowRDD = rdd.map(p => Row(p: _*)) // using splat is easier
// val rowRDD = rdd.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))) // this also works
val df = sqlContext.createDataFrame(sc.parallelize(rowRDD:List[Row]), schema)
df.show