Null values from a csv on Scala and Apache Spark

The reason for the null values is because the default "mode" for the csv API is PERMISSIVE:

mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.
- PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields.
- DROPMALFORMED : ignores the whole corrupted records.
- FAILFAST : throws an exception when it meets corrupted records

csv API

So if we load without a schema we see the following:

scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").load("data.csv")

df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]

scala> df.show
+----+-----+--------------------+------------+-----------+-----------+
|Rank|Grade|         Channelname|VideoUploads|Subscribers| Videoviews|
+----+-----+--------------------+------------+-----------+-----------+
| 1st| A++ |              Zee TV|       82757|   18752951|20869786591|
| 2nd| A++ |            T-Series|       12661|   61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...|         373|   19238251| 9793305082|
| 4th| A++ |           SET India|       27323|   31180559|22675948293|
| 5th| A++ |                 WWE|       36756|   32852346|26273668433|
| 6th| A++ |          Movieclips|       30243|   17149705|16618094724|
| 7th| A++ |          netd müzik|        8500|   11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...|      100147|   12149206|17202609850|
| 9th| A++ |     Ryan ToysReview|        1140|   16082927|24518098041|
|10th| A++ |         Zee Marathi|       74607|    2841811| 2591830307|
|11th|  A+ |     5-Minute Crafts|        2085|   33492951| 8587520379|
|12th|  A+ |     Canal KondZilla|         822|   39409726|19291034467|
|13th|  A+ |    Like Nastya Vlog|         150|    7662886| 2540099931|
|14th|  A+ |               Ozuna|          50|   18824912| 8727783225|
|15th|  A+ |          Wave Music|       16119|   15899764|10989179147|
|16th|  A+ |         Ch3Thailand|       49239|   11569723| 9388600275|
|17th|  A+ |     WORLDSTARHIPHOP|        4778|   15830098|11102158475|
|18th|  A+ |     Vlad and Nikita|          53|        -- | 1428274554|
+----+-----+--------------------+------------+-----------+-----------+

If we apply your schema we see this:

scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",IntegerType,true),StructField("Videoviews",IntegerType,true)))

scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]

scala> df.show
+----+-----+-----------+-------------+----------+----------+
|Rank|Grade|Channelname|Video Uploads|Suscribers|Videoviews|
+----+-----+-----------+-------------+----------+----------+
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
|null| null|       null|         null|      null|      null|
+----+-----+-----------+-------------+----------+----------+

Now if we look at your data we see Subscribers contains non Integer values ("--") and Videoviews contains values which exceed Integer max value (2,147,483,647)

So if we change the schema to conform with the data:

scala> val schema = StructType(Array(StructField("Rank",StringType,true),StructField("Grade", StringType, true),StructField("Channelname",StringType,true),StructField("Video Uploads",IntegerType,true), StructField("Suscribers",StringType,true),StructField("Videoviews",LongType,true)))
schema: org.apache.spark.sql.types.StructType = StructType(StructField(Rank,StringType,true), StructField(Grade,StringType,true), StructField(Channelname,StringType,true), StructField(Video Uploads,IntegerType,true), StructField(Suscribers,StringType,true), StructField(Videoviews,LongType,true))

scala> val df = spark.read.format("com.databricks.spark.csv").option("header","true").schema(schema).load("data.csv")
df: org.apache.spark.sql.DataFrame = [Rank: string, Grade: string ... 4 more fields]

scala> df.show
+----+-----+--------------------+-------------+----------+-----------+
|Rank|Grade|         Channelname|Video Uploads|Suscribers| Videoviews|
+----+-----+--------------------+-------------+----------+-----------+
| 1st| A++ |              Zee TV|        82757|  18752951|20869786591|
| 2nd| A++ |            T-Series|        12661|  61196302|47548839843|
| 3rd| A++ |Cocomelon - Nurse...|          373|  19238251| 9793305082|
| 4th| A++ |           SET India|        27323|  31180559|22675948293|
| 5th| A++ |                 WWE|        36756|  32852346|26273668433|
| 6th| A++ |          Movieclips|        30243|  17149705|16618094724|
| 7th| A++ |          netd müzik|         8500|  11373567|23898730764|
| 8th| A++ |ABS-CBN Entertain...|       100147|  12149206|17202609850|
| 9th| A++ |     Ryan ToysReview|         1140|  16082927|24518098041|
|10th| A++ |         Zee Marathi|        74607|   2841811| 2591830307|
|11th|  A+ |     5-Minute Crafts|         2085|  33492951| 8587520379|
|12th|  A+ |     Canal KondZilla|          822|  39409726|19291034467|
|13th|  A+ |    Like Nastya Vlog|          150|   7662886| 2540099931|
|14th|  A+ |               Ozuna|           50|  18824912| 8727783225|
|15th|  A+ |          Wave Music|        16119|  15899764|10989179147|
|16th|  A+ |         Ch3Thailand|        49239|  11569723| 9388600275|
|17th|  A+ |     WORLDSTARHIPHOP|         4778|  15830098|11102158475|
|18th|  A+ |     Vlad and Nikita|           53|       -- | 1428274554|
+----+-----+--------------------+-------------+----------+-----------+

Null values from a csv on Scala and Apache Spark

Tags:

Csv

Scala

Apache Spark

Apache Spark Mllib

Related

Recent Posts