PySpark: How to specify column with comma as decimal

You won't be able to read it as a float because the format of the data. You need to read it as a string, clean it up and then cast to float:

from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import FloatType

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = df.withColumn('revenue', regexp_replace('revenue', '\\.', ''))
df = df.withColumn('revenue', regexp_replace('revenue', ',', '.'))
df = df.withColumn('revenue', df['revenue'].cast("float"))

You can probably just chain these all together too:

df = spark.read.option("headers", "true").option("inferSchema", "true").csv("my_csv.csv", sep=";")
df = (
         df
         .withColumn('revenue', regexp_replace('revenue', '\\.', ''))
         .withColumn('revenue', regexp_replace('revenue', ',', '.'))
         .withColumn('revenue', df['revenue'].cast("float"))
     )

Please note this I haven't tested this so there may be a typo or two in there.

Tags:

Csv

Format

Comma

Pyspark

Related