Pyspark: Convert column to lowercase
You can use a combination of concat_ws and split
from pyspark.sql.functions import *
df.withColumn('arr_str', lower(concat_ws('::','arr'))).withColumn('arr', split('arr_str','::')).drop('arr_str')
Import lower
alongside col
:
from pyspark.sql.functions import lower, col
Combine them together using lower(col("bla"))
. In a complete query:
spark.table('bla').select(lower(col('bla')).alias('bla'))
which is equivalent to the SQL query
SELECT lower(bla) AS bla FROM bla
To keep the other columns, do
spark.table('foo').withColumn('bar', lower(col('bar')))
Needless to say, this approach is better than using a UDF because UDFs have to call out to Python (which is a slow operation, and Python itself is slow), and is more elegant than writing it in SQL.
Another approach which may be a little cleaner:
import pyspark.sql.functions as f
df.select("*", f.lower("my_col"))
this returns a data frame with all the original columns, plus lowercasing the column which needs it.