DataFrame columns names conflict with .(dot)
To access the column name
with a period using pyspark, do this:
spark.sql("select person.name from person_table")
Note: person_table is a registerTempTable on df.
My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.
Let's create a DataFrame with some sample data:
schema = StructType([
StructField("person.name", StringType(), True),
StructField("person", StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)]))
])
data = [
("charles", Row("chuck", 42)),
("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()
+-----------+-------------+
|person.name| person|
+-----------+-------------+
| charles| [chuck, 42]|
| larry|[chipper, 48]|
+-----------+-------------+
Let's illustrate that selecting person.name
will return different results depending on if backticks are used or not.
cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()
+-----+-----------+-----+-----------+
| name| person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck| charles|
|larry|[larry, 73]|larry| lawrence|
+-----+-----------+-----+-----------+
You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.
clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()
+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
| charles|chuck| 42|
| lawrence|larry| 73|
+-----------+-----+---+
This post explains how and why to avoid dots in PySpark columns names in more detail.
For the column name that contains .(dot)
you can use the `
character to enclose the column name
df.select("`person.name`")
This selects the outer String person.name: string (nullable = true)
And
df.select("person.name")
This gets the person name which is struct
|-- person: struct (nullable = true)
| |-- age: long (nullable = true)
If you have a column name you can just prepend and append ` character for the column name as
"`" + columnName + "`"
I hope this was helpful!