How to read csv without header and name them with names while reading in pyspark?
You can import the csv file into a dataframe with a predefined schema. The way you define a schema is by using the StructType
and StructField
objects. Assuming your data is all IntegerType
data:
from pyspark.sql.types import StructType, StructField, IntegerType
schema = StructType([
StructField("member_srl", IntegerType(), True),
StructField("click_day", IntegerType(), True),
StructField("productid", IntegerType(), True)])
df = spark.read.csv("user_click_seq.csv",header=False,schema=schema)
should work.
For those who would like to do this in scala and may not want to add types:
val df = spark.read.format("csv")
.option("header","false")
.load("hdfs_filepath")
.toDF("var0","var1","var2","var3")
You can read the data with header=False
and then pass the column names with toDF
as bellow:
data = spark.read.csv('data.csv', header=False)
data = data.toDF('name1', 'name2', 'name3')