DataFrame columns names conflict with .(dot)

To access the column name with a period using pyspark, do this:

spark.sql("select person.name from person_table")

Note: person_table is a registerTempTable on df.

My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.

Let's create a DataFrame with some sample data:

schema = StructType([
    StructField("person.name", StringType(), True),
    StructField("person", StructType([
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True)]))
])
data = [
    ("charles", Row("chuck", 42)),
    ("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()

+-----------+-------------+
|person.name|       person|
+-----------+-------------+
|    charles|  [chuck, 42]|
|      larry|[chipper, 48]|
+-----------+-------------+

Let's illustrate that selecting person.name will return different results depending on if backticks are used or not.

cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()

+-----+-----------+-----+-----------+
| name|     person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck|    charles|
|larry|[larry, 73]|larry|   lawrence|
+-----+-----------+-----+-----------+

You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.

clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()

+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
|    charles|chuck| 42|
|   lawrence|larry| 73|
+-----------+-----+---+

This post explains how and why to avoid dots in PySpark columns names in more detail.

For the column name that contains .(dot) you can use the ` character to enclose the column name

df.select("`person.name`")

This selects the outer String person.name: string (nullable = true)

And df.select("person.name")

This gets the person name which is struct

 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)

If you have a column name you can just prepend and append ` character for the column name as

"`" + columnName + "`"

I hope this was helpful!

DataFrame columns names conflict with .(dot)

Tags:

Scala

Apache Spark

Apache Spark Sql

Related

Recent Posts