Spark Dataframe Random UUID changes after every transformation/action

it is very old question but letting the people know what worked for me. It might help someone.

You could use the expr function as below to generate unique GUIDs which does not change on transformations.

import org.apache.spark.sql.functions._  
// create dataframe  
val df = spark.createDataset(Array(("a", "1"), ("b", "2"), ("c", "3"))).toDF("col1", "col2")   
df.createOrReplaceTempView("df")   
df.show(false)

// generate UUID for new column   
val dfWithUuid = df.withColumn("new_uuid", expr("uuid()"))
dfWithUuid.show(false)
dfWithUuid.show(false)    

// new transformations 
val dfWithUuidWithNewCol = dfWithUuid.withColumn("col3", df.col("col2")+1)
dfWithUuidWithNewCol.show(false)

Output is as below :

+----+----+
|col1|col2|
+----+----+
|a   |1   |
|b   |2   |
|c   |3   |
+----+----+

+----+----+------------------------------------+
|col1|col2|new_uuid                            |
+----+----+------------------------------------+
|a   |1   |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b   |2   |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c   |3   |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+

+----+----+------------------------------------+
|col1|col2|new_uuid                            |
+----+----+------------------------------------+
|a   |1   |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|
|b   |2   |43882a79-8e7f-4002-9740-f22bc6b20db5|
|c   |3   |64bc741a-0d7c-430d-bfe2-a4838f10acd0|
+----+----+------------------------------------+

+----+----+------------------------------------+----+
|col1|col2|new_uuid                            |col3|
+----+----+------------------------------------+----+
|a   |1   |01c4ef0f-9e9b-458e-b803-5f66df1f7cee|2.0 |
|b   |2   |43882a79-8e7f-4002-9740-f22bc6b20db5|3.0 |
|c   |3   |64bc741a-0d7c-430d-bfe2-a4838f10acd0|4.0 |
+----+----+------------------------------------+----+

I have a pyspark version:

from pyspark.sql import functions as f

pdataDF=dataDF.withColumn("uuid_column",f.expr("uuid()"))
display(pdataDF)
pdataDF.write.mode("overwrite").saveAsTable("tempUuidCheck")

It is an expected behavior. User defined functions have to be deterministic:

The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.

If you want to include non-deterministic function and preserve the output you should write intermediate data to a persistent storage and read it back. Checkpointing or caching may work in some simple cases but it won't be reliable in general.

If upstream process is deterministic (for starters there is shuffle) you could try to use rand function with seed, convert to byte array and pass to UUID.nameUUIDFromBytes.

See also: About how to add a new column to an existing DataFrame with random values in Scala

Note: SPARK-20586 introduced deterministic flag, which can disable certain optimization, but it is not clear how it behaves when data is persisted and a loss of executor occurs.

Spark Dataframe Random UUID changes after every transformation/action

Tags:

Uuid

Scala

Dataframe

Apache Spark

Related

Recent Posts