Replacing whitespace in all column names in spark Dataframe

As best practice, you should prefer expressions and immutability. You should use val and not var as much as possible.

Thus, it's preferable to use the foldLeft operator, in this case :

val newDf = df.columns
              .foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))

In Python, this can be done by the following code:

# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col

# Building a simple dataframe:
schema = StructType([
             StructField("id name", StringType(), True),
             StructField("cities venezuela", StringType(), True)
         ])

column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']

# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()

exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()

  var newDf = df
  for(col <- df.columns){
    newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
  }

You can encapsulate it in some method so it won't be too much pollution.

Replacing whitespace in all column names in spark Dataframe

Tags:

Scala

Apache Spark

Apache Spark Sql

Spark Dataframe

Related

Recent Posts