How to create a DataFrame from a text file in Spark

You will not able to convert it into data frame until you use implicit conversion.

val sqlContext = new SqlContext(new SparkContext())

import sqlContext.implicits._

After this only you can convert this to data frame

case class Test(id:String,filed2:String)

val myFile = sc.textFile("file.txt")

val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()

Update - as of Spark 1.6, you can simply use the built-in csv data source:

spark: SparkSession = // create the Spark Session
val df = spark.read.csv("file.txt")

You can also use various options to control the CSV parsing, e.g.:

val df = spark.read.option("header", "false").csv("file.txt")

For Spark version < 1.6: The easiest way is to use spark-csv - include it in your dependencies and follow the README, it allows setting a custom delimiter (;), can read CSV headers (if you have them), and it can infer the schema types (with the cost of an extra scan of the data).

Alternatively, if you know the schema you can create a case-class that represents it and map your RDD elements into instances of this class before transforming into a DataFrame, e.g.:

case class Record(id: Int, name: String)

val myFile1 = myFile.map(x=>x.split(";")).map {
  case Array(id, name) => Record(id.toInt, name)
} 

myFile1.toDF() // DataFrame will have columns "id" and "name"

If you want to use the toDF method, you have to convert your RDD of Array[String] into a RDD of a case class. For example, you have to do:

case class Test(id:String,filed2:String)
val myFile = sc.textFile("file.txt")
val df= myFile.map( x => x.split(";") ).map( x=> Test(x(0),x(1)) ).toDF()

I have given different ways to create DataFrame from text file

val conf = new SparkConf().setAppName(appName).setMaster("local")
val sc = SparkContext(conf)

raw text file

val file = sc.textFile("C:\\vikas\\spark\\Interview\\text.txt")
val fileToDf = file.map(_.split(",")).map{case Array(a,b,c) => 
(a,b.toInt,c)}.toDF("name","age","city")
fileToDf.foreach(println(_))

spark session without schema

import org.apache.spark.sql.SparkSession
val sparkSess = 
SparkSession.builder().appName("SparkSessionZipsExample")
.config(conf).getOrCreate()

val df = sparkSess.read.option("header", 
"false").csv("C:\\vikas\\spark\\Interview\\text.txt")
df.show()

spark session with schema

import org.apache.spark.sql.types._
val schemaString = "name age city"
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, 
StringType, nullable=true))
val schema = StructType(fields)

val dfWithSchema = sparkSess.read.option("header", 
"false").schema(schema).csv("C:\\vikas\\spark\\Interview\\text.txt")
dfWithSchema.show()

using sql context

import org.apache.spark.sql.SQLContext

val fileRdd = 
sc.textFile("C:\\vikas\\spark\\Interview\\text.txt").map(_.split(",")).map{x 
=> org.apache.spark.sql.Row(x:_*)}
val sqlDf = sqlCtx.createDataFrame(fileRdd,schema)
sqlDf.show()

How to create a DataFrame from a text file in Spark

raw text file

spark session without schema

spark session with schema

using sql context

Tags:

Scala

Dataframe

Apache Spark

Rdd

Apache Spark Sql

Related

Recent Posts