redshift: count distinct customers over window partition
While Redshift doesn't support DISTINCT aggregates in its window functions, it does have a listaggdistinct
function. So you can do this:
regexp_count(
listaggdistinct(customer_id, ',') over (partition by field2),
','
) + 1
Of course, if you have ,
naturally occurring in your customer_id strings, you'll have to find a safe delimiter.
A blog post from 2016 calls out this problem and provides a rudimentary workaround, so thank you Mark D. Adams. There is strangely very little I could find on all of the web therefore I'm sharing my (tested) solution.
The key insight is that dense_rank()
, ordered by the item in question, provides the same rank to identical items, and therefore the highest rank is also the count of unique items. This is a horrible mess if you try to swap in the following for each partition I want:
dense_rank() over(partition by order_month, traffic_channel order by customer_id)
Since you need the highest rank, you have to subquery everything and select the max value from each ranking taken. Its important to match the partitions in the outer query to the corresponding partition in the subquery.
/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
order_month
, traffic_channel
, max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
, max(tc_rnk) over(partition by traffic_channel) ytd_customers_by_channel
, max(mth_rnk) over(partition by order_month) monthly_customers_all_channels
, max(cust_rnk) over() ytd_total_customers
from (
select order_month
, traffic_channel
, dense_rank() over(partition by order_month, traffic_channel order by customer_id) tc_mth_rnk
, dense_rank() over(partition by traffic_channel order by customer_id) tc_rnk
, dense_rank() over(partition by order_month order by customer_id) mth_rnk
, dense_rank() over(order by customer_id) cust_rnk
from orders_traffic_channels
where to_char(order_month, 'YYYY') = '2017'
)
order by order_month, traffic_channel
;
notes
- partitions of
max()
anddense_rank()
must match dense_rank()
will rank null values (all at the same rank, the max). If you want to not countnull
values you need acase when customer_id is not null then dense_rank() ...etc...
, or you can subtract one from themax()
if you know there are nulls.
Update 2022
Count distinct over partitions in redshift is still not implemented.
I've concluded that this workaround is reasonable if you take care when incorporating it into production pipelines with these in mind:
- It creates a lot of code which can hurt readability and maintenance.
- Isolate this process of counting by groups into one transform stage rather than mixing this with other logical concepts in the same query.
- Using subqueries and non-partitioned groups with
count(distinct ..)
to get each of your distinct counts is even messier and less readable.
However, the better way is to use dataframe languages that support grouped rollups like Spark or Pandas. Spark rollups by group are compact and readable, the tradeoff is bringing another execution environment and language into your flows.
Another approach is to use
In first select:
row_number() over (partition by customer_id,order_month,traffic_channel) as row_n_month_channel
and in the next select
sum(case when row_n_month_channel=1 then 1 else 0 end)