Exists subselect vs inner join?
The purpose of your subquery as I understand it:
Select rows where the latest related entry in billing_pricequotestatus
has a qualifying name
.
Incorrect 2nd query
It's not immediately clear to me why my modification didn't produce equivalent output
The 1st query picks the latest row from billing_pricequotestatus
and checks whether the name
qualifies (name IN ('adjustment','payment','billable')
).
The 2nd query is backwards: it check for any row with qualifying name
(not just the last one). Also, it doesn't make sense to compute an aggregate in an EXISTS
semi-join. You don't want that. And it's not equivalent.
Consequently, you get more rows from the 2nd query.
Incorrect time range
This predicate is a mess. Inefficient and possibly incorrect - or at least a ticking bomb:
WHERE pq.date_applied AT TIME ZONE 'PST'
BETWEEN '2016-02-02T00:00:00'::timestamp
AND '2016-03-03T22:27:41.734102-08:00'::timestamptz
The column date_applied
is of type timestamptz
. The construct AT TIME ZONE 'PST'
converts it to type timestamp
and shifts by the time offset hard-coded into the time zone abbreviation 'PST' - which is a bad move to begin with. It makes the expression non-sargable. That's more expensive and, more importantly, rules out the use of any plain index on date_applied
.
Even worse, the time zone abbreviation 'PST'
is not aware of DST or any historic shifts in time. If your time zone has (or had in the past) daylight saving time, and your set spans different DST periods, your current expression is most probably incorrect:
- "AT TIME ZONE" with zone name PostgreSQL bug?
- Ignoring timezones altogether in Rails and PostgreSQL
You would need to use the applicable time zone name instead of the abbreviation to get consistent local time - which is even more expensive.
And there's yet another problem: While the column value is shifted by a hard-coded time offset ('PST'), your upper bound '2016-03-03T22:27:41.734102-08:00'::timestamptz
is provided as timestamptz
and silently coerced to match the data type timestamp
. Since no explicit time offset is provided, the cast defaults to the time zone of the current session. So you can get different results depending on the current time zone setting of your session. I can't think of a use case where this would make sense.
Don't do any of this. Don't translate timestamptz
column date_applied
to local time at all, don't mix data types like you do and don't mix different ways to cast. Instead, use the column as is and provide timestamptz
parameters.
Query
SELECT i.quote_id, i.acct_id AS account_id, sum(i.delta_amount) AS amt
FROM billing_pricequote pq
JOIN LATERAL (
SELECT name
FROM billing_pricequotestatus
WHERE quote_id = pq.id
ORDER BY created_at DESC
LIMIT 1
) pqs ON pqs.name IN ('adjustment', 'payment', 'billable')
JOIN billing_lineitem i ON i.quote_id = pq.id
WHERE pq.date_applied BETWEEN (timestamp '2016-02-02T00:00:00' AT TIME ZONE 'PST') -- !
AND timestamptz '2016-03-03T22:27:41.734102-08:00'
GROUP BY 1,2;
Note the LATERAL
join, but not , make it an LEFT JOIN
INNER JOIN
to implement your predicate right away.
Or use the equivalent correlated subquery outlined by @ypercube. Not sure which is faster.
Also note that I base the LATERAL JOIN
on billing_pricequote
- before joining to the big table billing_lineitem
. This way we can eliminate rows early, which should be cheaper.
Index
Currently, you get:
Seq Scan on billing_pricequote pq
Only 70k of your 1,5M rows are selected, which is around 5 %. An index on date_applied
might help a bit, but not much. However, this multicolumn index should help substantially if you can get index-only scans out of it:
CREATE INDEX foo ON billing_pricequotestatus (quote_id, created_at DESC, name);
Even more efficient with name_id
instead of name
as suggested below.
Statistics
Postgres is over-estimating the selectivity of your time range:
(cost=0.00..88,546.50 rows=7,313 width=4) (actual time=2.353..767.408 rows=70,623 loops=1)
It might help to increase the statistics target for just the column date_applied
. Details here:
- Configuring PostgreSQL for read performance
Table definition
Exemplary for billing_pricequotestatus
:
name
seems to be one of a couple of possible types. It would help performance to normalize some more and just use a 4-byte integer
referencing a lookup-table instead of a varchar(20)
repeated over an over in 3.3M rows. Also, reordering columns (if possible) like I demonstrate would help some more:
Column | Type | Modifiers
------------+--------------------------+------------------------------------------
id | integer | not null default nextval('...
quote_id | integer | not null
created_at | timestamp with time zone | not null
updated_at | timestamp with time zone | not null
name_id | integer | not null REFERENCES name_table(name_id)
notes | text | not null
See link above about alignment and padding. To measure row size:
- Measure the size of a PostgreSQL table row
And "name" is not a good identifier. I would use something descriptive instead.
I think that the EXISTS
subquery:
AND EXISTS(
SELECT s1.quote_id
FROM billing_pricequotestatus s1
INNER JOIN
( SELECT DISTINCT ON (quote_id) quote_id, MAX(created_at) AS max_created_at
FROM billing_pricequotestatus
WHERE quote_id=i.quote_id
GROUP BY quote_id, created_at
ORDER BY quote_id, created_at DESC
) AS s2
ON s1.quote_id = s2.quote_id
AND s1.created_at = s2.max_created_at
WHERE s1.name IN ('adjustment','payment','billable')
)
can be simplified to a correlated subquery:
AND ( SELECT name
FROM billing_pricequotestatus
WHERE quote_id = i.quote_id
ORDER BY created_at DESC
LIMIT 1
) IN ('adjustment', 'payment', 'billable')
An index on (quote_id, created_at DESC, name)
would help a lot.
If your version of Postgres is 9.3 or later, it can also be written with a LATERAL
join, which may have improved efficiency.
The first question you posted at SO did not include the WHERE quote_id = i.quote_id
so it was not possible for others to know that the subquery was correlated. The answer you got there is correct for that case.