Exists subselect vs inner join?

The purpose of your subquery as I understand it:
Select rows where the latest related entry in billing_pricequotestatus has a qualifying name.

Incorrect 2nd query

It's not immediately clear to me why my modification didn't produce equivalent output

The 1st query picks the latest row from billing_pricequotestatus and checks whether the name qualifies (name IN ('adjustment','payment','billable')).

The 2nd query is backwards: it check for any row with qualifying name (not just the last one). Also, it doesn't make sense to compute an aggregate in an EXISTS semi-join. You don't want that. And it's not equivalent.

Consequently, you get more rows from the 2nd query.

Incorrect time range

This predicate is a mess. Inefficient and possibly incorrect - or at least a ticking bomb:

Click to copy

WHERE pq.date_applied AT TIME ZONE 'PST'
       BETWEEN '2016-02-02T00:00:00'::timestamp
           AND '2016-03-03T22:27:41.734102-08:00'::timestamptz

The column date_applied is of type timestamptz. The construct AT TIME ZONE 'PST' converts it to type timestamp and shifts by the time offset hard-coded into the time zone abbreviation 'PST' - which is a bad move to begin with. It makes the expression non-sargable. That's more expensive and, more importantly, rules out the use of any plain index on date_applied.

Even worse, the time zone abbreviation 'PST' is not aware of DST or any historic shifts in time. If your time zone has (or had in the past) daylight saving time, and your set spans different DST periods, your current expression is most probably incorrect:

"AT TIME ZONE" with zone name PostgreSQL bug?
Ignoring timezones altogether in Rails and PostgreSQL

You would need to use the applicable time zone name instead of the ~~abbreviation~~ to get consistent local time - which is even more expensive.

And there's yet another problem: While the column value is shifted by a hard-coded time offset ('PST'), your upper bound '2016-03-03T22:27:41.734102-08:00'::timestamptz is provided as timestamptz and silently coerced to match the data type timestamp. Since no explicit time offset is provided, the cast defaults to the time zone of the current session. So you can get different results depending on the current time zone setting of your session. I can't think of a use case where this would make sense.

Don't do any of this. Don't translate timestamptz column date_applied to local time at all, don't mix data types like you do and don't mix different ways to cast. Instead, use the column as is and provide timestamptz parameters.

Query

Click to copy

SELECT i.quote_id, i.acct_id AS account_id, sum(i.delta_amount) AS amt
FROM   billing_pricequote pq
JOIN   LATERAL (
   SELECT name
   FROM   billing_pricequotestatus
   WHERE  quote_id = pq.id
   ORDER  BY created_at DESC
   LIMIT  1
   ) pqs ON pqs.name IN ('adjustment', 'payment', 'billable')
JOIN   billing_lineitem i ON i.quote_id = pq.id
WHERE  pq.date_applied BETWEEN (timestamp '2016-02-02T00:00:00' AT TIME ZONE 'PST')  -- !
                           AND timestamptz '2016-03-03T22:27:41.734102-08:00'
GROUP  BY 1,2;

Note the LATERAL join, but not ~~LEFT JOIN~~, make it an INNER JOIN to implement your predicate right away.

Or use the equivalent correlated subquery outlined by @ypercube. Not sure which is faster.

Also note that I base the LATERAL JOIN on billing_pricequote - before joining to the big table billing_lineitem. This way we can eliminate rows early, which should be cheaper.

Index

Currently, you get:

Seq Scan on billing_pricequote pq

Only 70k of your 1,5M rows are selected, which is around 5 %. An index on date_applied might help a bit, but not much. However, this multicolumn index should help substantially if you can get index-only scans out of it:

Click to copy

CREATE INDEX foo ON billing_pricequotestatus (quote_id, created_at DESC, name);

Even more efficient with name_id instead of name as suggested below.

Statistics

Postgres is over-estimating the selectivity of your time range:

(cost=0.00..88,546.50 rows=7,313 width=4) (actual time=2.353..767.408 rows=70,623 loops=1)

It might help to increase the statistics target for just the column date_applied. Details here:

Configuring PostgreSQL for read performance

Table definition

Exemplary for billing_pricequotestatus:

name seems to be one of a couple of possible types. It would help performance to normalize some more and just use a 4-byte integer referencing a lookup-table instead of a varchar(20) repeated over an over in 3.3M rows. Also, reordering columns (if possible) like I demonstrate would help some more:

Click to copy

   Column   |           Type           |              Modifiers
------------+--------------------------+------------------------------------------
 id         | integer                  | not null default nextval('...
 quote_id   | integer                  | not null
 created_at | timestamp with time zone | not null
 updated_at | timestamp with time zone | not null
 name_id    | integer                  | not null REFERENCES name_table(name_id)
 notes      | text                     | not null

See link above about alignment and padding. To measure row size:

Measure the size of a PostgreSQL table row

And "name" is not a good identifier. I would use something descriptive instead.

I think that the EXISTS subquery:

Click to copy

AND EXISTS(
  SELECT s1.quote_id
  FROM billing_pricequotestatus s1
    INNER JOIN 
      ( SELECT DISTINCT ON (quote_id) quote_id, MAX(created_at) AS max_created_at
        FROM billing_pricequotestatus
        WHERE quote_id=i.quote_id
        GROUP BY quote_id, created_at
        ORDER BY quote_id, created_at DESC
      ) AS s2
    ON s1.quote_id = s2.quote_id
    AND s1.created_at = s2.max_created_at
  WHERE s1.name IN ('adjustment','payment','billable')
)

can be simplified to a correlated subquery:

Click to copy

AND   ( SELECT name
        FROM billing_pricequotestatus
        WHERE quote_id = i.quote_id
        ORDER BY created_at DESC
        LIMIT 1
      ) IN ('adjustment', 'payment', 'billable')

An index on (quote_id, created_at DESC, name) would help a lot.

If your version of Postgres is 9.3 or later, it can also be written with a LATERAL join, which may have improved efficiency.

The first question you posted at SO did not include the WHERE quote_id = i.quote_id so it was not possible for others to know that the subquery was correlated. The answer you got there is correct for that case.

Exists subselect vs inner join?

Incorrect 2nd query

Incorrect time range

Query

Index

Statistics

Table definition

Tags:

Performance

Postgresql

Timestamp

Greatest N Per Group

Postgresql Performance

Related

Recent Posts