RANK used in 'where' returns invalid column, but exists in results set

You cannot use aliases in WHERE clauses, nor can you use Window functions.

To demonstrate, consider this example (fiddle).

CREATE TABLE payment (amount INTEGER, pay_date DATE);

INSERT INTO payment VALUES (54,  '2019-09-01'), (56,  '2019-09-01'), (154, '2019-09-02'), 
(156, '2019-09-02'), (254, '2019-09-03'), (256, '2019-09-03');

Then run the two following queries:

Click to copy

SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
WHERE the_sums > 200; -- ERROR:  column "the_sums" does not exist

and

Click to copy

SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
WHERE SUM(amount) OVER (PARTITION BY pay_date) > 200;
-- ERROR:  window functions are not allowed in WHERE

You didn't post your error message or PostgreSQL version, but your problem is that timestamp_d1_rank and timestamp_d2_rank are aliases.

You have two options:

First Option: use the aliases/Window functions within a subquery (also in fiddle)

Click to copy

SELECT * FROM
(
SELECT 
  *, 
  SUM(amount) OVER (PARTITION BY pay_date) AS the_sums
FROM payment
) AS tab
WHERE the_sums < 200;

Result:

Click to copy

amount    pay_date  the_sums
    54  2019-09-01       110
    56  2019-09-01       110

Second Option: you can use a CTE (aka the WITH clause as outlined here or as in @ypercubeᵀᴹ 's answer above or also see the fiddle.

For a good explanation of this whole area, see here.

You cannot use window/ranking functions or their aliases in the WHERE clause because they are evaluated in SELECT, after the WHERE has been evaluated.

You can use a subquery (derived table or CTE) to process a second WHERE clause, after the window functions have been evaluated:

Click to copy

SELECT d1, d2,
       "DID",
       t1, t2,
       coords_centroid_a,
       coords_centroid_b
FROM
  (
    SELECT  a.timestamp_intersecting_date d1, b.timestamp_intersecting_date d2,   
            a."DID", 
            a.timestamp_intersecting_max t1, b.timestamp_intersecting_min t2,
            RANK () OVER ( 
              PARTITION BY a.timestamp_intersecting_date
              ORDER BY a.timestamp_intersecting_max DESC
           ) timestamp_d1_rank ,
            RANK () OVER ( 
              PARTITION BY b.timestamp_intersecting_date
              ORDER BY b.timestamp_intersecting_max ASC
           ) timestamp_d2_rank,
            a.coords_centroid AS coords_centroid_a, 
            b.coords_centroid AS coords_centroid_b
    FROM
        signals a
    INNER JOIN signals b ON (a."DID" = b."DID")
    WHERE (b.timestamp_intersecting_date = a.timestamp_intersecting_date + INTERVAL '1 DAY')
    AND a."DID" = b."DID"
  ) AS t
WHERE t.timestamp_d1_rank = 1
  AND t.timestamp_d2_rank = 1
ORDER BY "DID", t1 DESC, t2 ASC ;

The confusion with syntax rules concerning input and output column names has been addressed properly by existing answers. This is about a lurking performance problem.

There is an transient CROSS JOIN in your query that scales terribly with growing number of rows per ("DID", date_col). Example: If a "DID" has 100 coordinates per day, the query ends up processing 10.000 (!) combinations with the next day.

Based on this assumed table definition:

Click to copy

CREATE TABLE signals (
   "DID" serial PRIMARY KEY
 , date_col date NOT NULL  -- ?
 , time_col time NOT NULL  -- ?
 , coords_centroid geography  -- ?
);

And assuming there are only few rows per ("DID", date_col), this query should be much faster already:

Click to copy

SELECT a."DID", a.date_col, t1, coords1, t2, coords2
FROM  (  -- first row per ("DID", date_col)
   SELECT DISTINCT ON ("DID", date_col)
          "DID", date_col, time_col AS t2, coords_centroid AS coords2
   FROM   signals
   ORDER  BY "DID", date_col, time_col
   ) b
CROSS JOIN LATERAL (  -- corresponding last row from previous day
   SELECT "DID", date_col, time_col AS t1, coords_centroid AS coords1
   FROM   signals a
   WHERE  a."DID" = b."DID"
   AND    a.date_col = b.date_col - 1
   ORDER  BY time_col DESC
   ) a;

Needs an index on ("DID", date_col, time_col).

More might be done here, depending on your actual setup and actual data distribution. I'll elaborate if you are interested and provide missing information ...

RANK used in 'where' returns invalid column, but exists in results set

Tags:

Postgresql

Related

Recent Posts