Use Map to replace column values in Spark

Instead of Map[Column, Column] you should use a Column containing a map literal:

import org.apache.spark.sql.functions.typedLit

val translationMap: Column = typedLit(Map(
  "foo" -> "bar",
  "baz" -> "bab"
))

The rest of your code can stay as-is:

Click to copy

df.select(
  col("mov"),
  translationMap(col("mov"))
).show

Click to copy

+---+---------------------------------------+
|mov|keys: [foo,baz], values: [bar,bab][mov]|
+---+---------------------------------------+
|foo|                                    bar|
|baz|                                    bab|
+---+---------------------------------------+

You can not refer a Scala collection declared on the driver like this inside a distributed dataframe. An alternative would be to use a UDF which will not be performance efficient if you have a large dataset since UDFs are not optimized by Spark.

Click to copy

val translationMap = Map( "foo" -> "bar" , "baz" -> "bab" )
val getTranslationValue = udf ((x: String)=>translationMap.getOrElse(x,null.asInstanceOf[String]) )
df.select(col("mov"), getTranslationValue($"mov").as("value")  ).show

//+---+-----+
//|mov|value|
//+---+-----+
//|foo|  bar|
//|baz|  bab|
//+---+-----+

Another solution would be to load the Map as a DataSet[(String, String)] and the join the two datasets taking mov as the key.

Use Map to replace column values in Spark

Tags:

Scala

Apache Spark

Apache Spark Sql

Related

Recent Posts