Does Postgres optimize this JOIN with subquery?
To get an actual performance boost, LEFT JOIN
to the aggregating subquery, but repeat the (selective!) predicate of the outer query.
SELECT po.number
, SUM(grouped_items.total_quantity) AS total_quantity
FROM purchase_orders po
LEFT JOIN (
SELECT purchase_order_id AS id
, SUM(quantity) AS total_quantity
FROM items
WHERE purchase_order_id IN (1, 2, 3) -- repeat selective condition !
GROUP BY 1
) grouped_items USING (id)
WHERE po.id IN (1, 2, 3)
GROUP BY po.number;
Luckily, this is possible in your case: the predicate is applicable in the subquery. A bit more verbose, but typically delivers best performance regardless of what percentage of rows in items
is involved. My rule of thumb is "aggregate first and join later". See:
- Postgres CTE optimization with nested json_build_object
Other cases are not so lucky. Then you have to decide which way to go. Postgres is not smart enough to optimize much, as jjanes explained. If all or most rows from items
are involved, aggregating in a subquery is typically (much) faster. If only few rows are involved, correlated subqueries or an equivalent LATERAL
subquery is typically (much) faster. See:
- What is the difference between LATERAL and a subquery in PostgreSQL?
For only 3 rows from the outer query (WHERE po.id IN (1, 2, 3)
), a correlated subquery can't go wrong. But I assume that's just simplification for the demo.
I've been looking into this issue a bit myself lately, and my conclusion is that the planner is not smart enough to optimize this particular thing. The correlated subselect will be executed once for each row even if that is a huge number of rows, while the uncorrelated one will be executed to completion even if only a few rows from it are needed.
It does know that one will be faster than the other (assuming the estimated row counts are reasonably correct) but it lacks the capacity to recognize the two formulations are identical and so choose between execution plans based on estimated performance.
Although in your case, the queries would not be identical because they handle missing rows in "items" differently. The correlated subselect would be identical to the left join, not the inner join.