Count rows in partition with Order By
The "why" has already been explained by others. Sometimes you have an ordered window, and you have to do a count over the whole partition despite having an ORDER BY
.
To do so, use an unbounded range with RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
When you add an order by
to an aggregate used as a window function that aggregate turns into a "running count" (or whatever aggregate you use).
The count(*)
will return the number of rows up until the "current one" based on the order specified.
The following query shows the different results for aggregates used with an order by
. With sum()
instead of count()
it's a bit easier to see (in my opinion).
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
will result in:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
There are more examples in the Postgres manual
Your two expressions are:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
Why would you expect these to return the same values? The syntax is different for a reason.
The first returns the overall count for each num
-- essentially joining back the aggregated value.
The second does a cumulative count. It does the COUNT()
for each row of id
, for all values up to that id
s value.
Note that such cumulative counts would normally be implemented using RANK()
(or related functions).
The cumulative count is subtly different from RANK()
. The cumulative count implements:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK()
is slightly different. The difference only matters when the ORDER BY
keys have ties.