Hive partitioned table reads all the partitions despite having a Spark filter
This can happen when metastore does not have the partition values for the partition column. Can we run from Spark
ALTER TABLE db.table RECOVER PARTITIONS
And then rerun the same query.
A parquet hive table in Spark can use following 2 read flows -
Hive flow - This will be used when
spark.sql.hive.convertMetastoreParquet
is set tofalse
. For partitioning pruining to work in this case, you have to setspark.sql.hive.metastorePartitionPruning=true
.spark.sql.hive.metastorePartitionPruning: When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information
Datasource flow - This flow by default has partition pruning turned on.