how to create a new columns with random values in pyspark?

Just generate a list of values and then extract them randomly :

from pyspark.sql import functions as F

df.withColumn(
  "business_vertical",
  F.array(
    F.lit("Retail"),
    F.lit("SME"),
    F.lit("Cor"),
  ).getItem(
    (F.rand()*3).cast("int")
  )
)

Here's how you can solve this with the array_choice function in quinn:

import quinn

df = spark.createDataFrame([('a',), ('b',), ('c',)], ['letter'])
cols = list(map(lambda c: F.lit(c), ['Retail', 'SME', 'Cor']))
df.withColumn('business_vertical', quinn.array_choice(F.array(cols))).show()

+------+-----------------+
|letter|business_vertical|
+------+-----------------+
|     a|              SME|
|     b|           Retail|
|     c|              SME|
+------+-----------------+

array_choice is generic and can easily be used to select a random value from an existing ArrayType column. Suppose you have the following DataFrame.

+------------+
|     letters|
+------------+
|   [a, b, c]|
|[a, b, c, d]|
|         [x]|
|          []|
+------------+

Here's how you can grab a random letter.

actual_df = df.withColumn(
    "random_letter",
    quinn.array_choice(F.col("letters"))
)
actual_df.show()

+------------+-------------+
|     letters|random_letter|
+------------+-------------+
|   [a, b, c]|            a|
|[a, b, c, d]|            d|
|         [x]|            x|
|          []|         null|
+------------+-------------+

Here's the array_choice function definition:

def array_choice(col):
    index = (F.rand()*F.size(col)).cast("int")
    return col[index]

This post explains fetching random values from PySpark arrays in more detail.

how to create a new columns with random values in pyspark?

Tags:

Python

Pandas

Pyspark

Related

Recent Posts