How to import multiple csv files in a single load?
Use wildcard, e.g. replace 2008
with *
:
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("../Downloads/*.csv") // <-- note the star (*)
Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
spark.read.option("header", "true").csv("../Downloads/*.csv")
Notes:
Replace
format("com.databricks.spark.csv")
by usingformat("csv")
orcsv
method instead.com.databricks.spark.csv
format has been integrated to 2.0.Use
spark
notsqlContext
Ex1:
Reading a single CSV file. Provide complete file path:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\cars1.csv")
Ex2:
Reading multiple CSV files passing names:
val df=spark.read.option("header","true").csv("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
Ex3:
Reading multiple CSV files passing list of names:
val paths = List("C:spark\\sample_data\\tmp\\cars1.csv", "C:spark\\sample_data\\tmp\\cars2.csv")
val df = spark.read.option("header", "true").csv(paths: _*)
Ex4:
Reading multiple CSV files in a folder ignoring other files:
val df = spark.read.option("header", "true").csv("C:spark\\sample_data\\tmp\\*.csv")
Ex5:
Reading multiple CSV files from multiple folders:
val folders = List("C:spark\\sample_data\\tmp", "C:spark\\sample_data\\tmp1")
val df = spark.read.option("header", "true").csv(folders: _*)