Filtering a Pyspark DataFrame with SQL-like IN clause

reiterating what @zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below

from pyspark.sql.functions import col

df.where(col("v").isin(["foo", "bar"])).count()

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:

df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
##  2

Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.

In practice DataFrame DSL is a much better choice when you want to create dynamic queries:

from pyspark.sql.functions import col

df.where(col("v").isin({"foo", "bar"})).count()
## 2

It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

Filtering a Pyspark DataFrame with SQL-like IN clause

Tags:

Python

Sql

Dataframe

Apache Spark

Pyspark

Related

Recent Posts