Replacing whitespace in all column names in spark Dataframe
As best practice, you should prefer expressions and immutability.
You should use val
and not var
as much as possible.
Thus, it's preferable to use the foldLeft
operator, in this case :
val newDf = df.columns
.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))
In Python, this can be done by the following code:
# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col
# Building a simple dataframe:
schema = StructType([
StructField("id name", StringType(), True),
StructField("cities venezuela", StringType(), True)
])
column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']
# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()
exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()
var newDf = df
for(col <- df.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
}
You can encapsulate it in some method so it won't be too much pollution.