Hive partitioned table reads all the partitions despite having a Spark filter

This can happen when metastore does not have the partition values for the partition column. Can we run from Spark

ALTER TABLE db.table RECOVER PARTITIONS

And then rerun the same query.

A parquet hive table in Spark can use following 2 read flows -

Hive flow - This will be used when spark.sql.hive.convertMetastoreParquet is set to false. For partitioning pruining to work in this case, you have to set spark.sql.hive.metastorePartitionPruning=true.

spark.sql.hive.metastorePartitionPruning: When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information
Datasource flow - This flow by default has partition pruning turned on.

Related