how to create a new columns with random values in pyspark?
Just generate a list of values and then extract them randomly :
from pyspark.sql import functions as F
df.withColumn(
"business_vertical",
F.array(
F.lit("Retail"),
F.lit("SME"),
F.lit("Cor"),
).getItem(
(F.rand()*3).cast("int")
)
)
Here's how you can solve this with the array_choice
function in quinn:
import quinn
df = spark.createDataFrame([('a',), ('b',), ('c',)], ['letter'])
cols = list(map(lambda c: F.lit(c), ['Retail', 'SME', 'Cor']))
df.withColumn('business_vertical', quinn.array_choice(F.array(cols))).show()
+------+-----------------+
|letter|business_vertical|
+------+-----------------+
| a| SME|
| b| Retail|
| c| SME|
+------+-----------------+
array_choice
is generic and can easily be used to select a random value from an existing ArrayType column. Suppose you have the following DataFrame.
+------------+
| letters|
+------------+
| [a, b, c]|
|[a, b, c, d]|
| [x]|
| []|
+------------+
Here's how you can grab a random letter.
actual_df = df.withColumn(
"random_letter",
quinn.array_choice(F.col("letters"))
)
actual_df.show()
+------------+-------------+
| letters|random_letter|
+------------+-------------+
| [a, b, c]| a|
|[a, b, c, d]| d|
| [x]| x|
| []| null|
+------------+-------------+
Here's the array_choice
function definition:
def array_choice(col):
index = (F.rand()*F.size(col)).cast("int")
return col[index]
This post explains fetching random values from PySpark arrays in more detail.