Sparksql filtering (selecting with where clause) with multiple conditions
Your are using logical conjunction (AND). It means that all columns have to be different than 'null'
for row to be included. Lets illustrate that using filter
version as an example:
numeric = sqlContext.createDataFrame([
('3.5,', '5.0', 'null'), ('2.0', '14.0', 'null'), ('null', '38.0', 'null'),
('null', 'null', 'null'), ('1.0', 'null', '4.0')],
('low', 'high', 'normal'))
numeric_filtered_1 = numeric.where(numeric['LOW'] != 'null')
numeric_filtered_1.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
numeric_filtered_2 = numeric_filtered_1.where(
numeric_filtered_1['NORMAL'] != 'null')
numeric_filtered_2.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## |1.0|null| 4.0|
## +---+----+------+
numeric_filtered_3 = numeric_filtered_2.where(
numeric_filtered_2['HIGH'] != 'null')
numeric_filtered_3.show()
## +---+----+------+
## |low|high|normal|
## +---+----+------+
## +---+----+------+
All remaining methods you've tried follow exactly the same schema. What you need here is a logical disjunction (OR).
from pyspark.sql.functions import col
numeric_filtered = df.where(
(col('LOW') != 'null') |
(col('NORMAL') != 'null') |
(col('HIGH') != 'null'))
numeric_filtered.show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
or with raw SQL:
numeric.registerTempTable("numeric")
sqlContext.sql("""SELECT * FROM numeric
WHERE low != 'null' OR normal != 'null' OR high != 'null'"""
).show()
## +----+----+------+
## | low|high|normal|
## +----+----+------+
## |3.5,| 5.0| null|
## | 2.0|14.0| null|
## |null|38.0| null|
## | 1.0|null| 4.0|
## +----+----+------+
See also: Pyspark: multiple conditions in when clause