Calculate 40 day moving average w.r.t to a field
It is not really clear from the question what is the role of the call_type_id
column. I will ignore it until you clarify.
Without window functions
Here is a simple variant that doesn't use window functions at all.
Make sure that there is an index on (call_dt_key, aes_raw)
.
CTE_Dates
returns a list of all dates in the table and calculates average for each day. This average_current_day
will be needed for the first day. The server will scan the whole index any way, so calculating such average is cheap.
Then, for each distinct day I use a self-join to calculate the average for 40 previous days. This will return NULL
for the first day, which is replaced with average_current_day
in the main query.
You don't have to use CTE here, it just makes the query easier to read.
WITH
CTE_Dates
AS
(
SELECT
call_dt_key
,call_dt_key - INTERVAL '41 day' AS dt_from
,call_dt_key - INTERVAL '1 day' AS dt_to
,AVG(test_aes.aes_raw) AS average_current_day
FROM test_aes
GROUP BY call_dt_key
)
SELECT
CTE_Dates.call_dt_key
,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
CTE_Dates
LEFT JOIN LATERAL
(
SELECT AVG(test_aes.aes_raw) AS average_40
FROM test_aes
WHERE
test_aes.call_dt_key >= CTE_Dates.dt_from
AND test_aes.call_dt_key <= CTE_Dates.dt_to
) AS prev40 ON true
ORDER BY call_dt_key;
Result
| call_dt_key | average_40 |
|----------------------------|--------------------|
| January, 01 2016 00:00:00 | 15 |
| January, 05 2016 00:00:00 | 15 |
| January, 10 2016 00:00:00 | 15.333333333333334 |
| January, 15 2016 00:00:00 | 15.294117647058824 |
| January, 16 2016 00:00:00 | 15.5 |
| January, 20 2016 00:00:00 | 15.652173913043478 |
| January, 21 2016 00:00:00 | 15.6 |
| January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
| March, 04 2016 00:00:00 | 15 |
| March, 18 2016 00:00:00 | 15 |
Here is SQL Fiddle.
With the recommended index this solution should not be too bad.
There is a similar question, but for SQL Server (Date range rolling sum using window functions). Postgres seems to support RANGE
with a window of specified size, while SQL Server doesn't at this moment. So, solution for Postgres is likely to be a bit simpler.
The key part would be:
AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)
To calculate the moving average using these window functions you'd likely have to fill the gaps in dates first, so that the table has at least one row for each day (with NULL
values for aes_raw
in these dummy rows).
...
As Erwin Brandstetter correctly pointed out in his answer, at the moment (as of Postgres 9.5) the RANGE
clause in Postgres still has limitations similar to SQL Server. Docs say:
The value PRECEDING and value FOLLOWING cases are currently only allowed in ROWS mode.
So, this method with the RANGE
above would not work for you even if you used Postgres 9.5.
Using window functions
You can use approaches outlined in the question for SQL Server above. For example, group your data into daily sums, add rows for missing days, calculate the moving SUM
and COUNT
using OVER
with ROWS
and then calculate moving average.
Something along these lines:
WITH
CTE_Dates
AS
(
SELECT
call_dt_key
,SUM(test_aes.aes_raw) AS sum_daily
,COUNT(*) AS cnt_daily
,AVG(test_aes.aes_raw) AS avg_daily
,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
FROM test_aes
GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
SELECT
CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
,avg_daily
,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END)
OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
/SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END)
OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
FROM
CTE_Dates
INNER JOIN LATERAL
generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day')
AS all_dates(dt) ON true
)
SELECT
final_dt
,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;
Result is the same as in the first variant. See SQL Fiddle.
Again, this could be written with inlined sub-queries without CTEs.
It is worth checking on real data the performance of different variants.
The big bounty makes the currently accepted answer seem exemplary, but I am not entirely happy with several details. Hence, I added this answer.
Table definition
You should have provided an actual table definition to make this easier.
Judging from the sample data, call_dt_tm
is type timestamp with time zone
(timestamptz
).
The column call_dt_key
is not completely functionally dependent, since the matching date depends on the time zone. But if you define that (not just an offset, beware of DST!), the date can easily and reliably be derived from a timestamptz
and should not be stored redundantly. To get it right, use an expression like:
(call_dt_tm AT TIME ZONE 'Asia/Hong_Kong')::date -- use your time zone name
Details:
- Ignoring timezones altogether in Rails and PostgreSQL
You might add a MATERIALIZED VIEW
with the derived date column for ease of use ...
For the purpose of this question I'll stick to your given table.
40 days
Question and answer both count 41 days instead of 40 as per requirement. Lower and upper bound are included, resulting in a (rather common) off-by-one error.
Consequently, I get different results in two rows below.
date
, interval
, timestamp
Subtracting an interval
from a date
produces a timestamp
(like in call_dt_key - INTERVAL '41 day'
). For the purpose of this query it is more efficient to subtract an integer
, producing another date
(like call_dt_key - 41
).
Not possible with a RANGE
clause
@Vladimir suggested (now fixed) a solution with the RANGE
clause in the frame definition of window functions in Postgres 9.5.
In fact, nothing has changed between Postgres 9.4 and 9.5 in this respect, not even the text in the manual. Frame definition of window functions only allow RANGE UNBOUNDED PRECEDING
and RANGE UNBOUNDED FOLLOWING
- not with values.
Answer
Of course, you can use a CTE to compute daily sum / count / avg on the fly. But your table ...
stores the information about user calls in a call center
This kind of information does not change later. So compute daily aggregates once in a materialized view and build on that.
CREATE MATERIALIZED VIEW mv_test_aes AS
SELECT call_dt_key AS day
, sum(aes_raw)::int AS day_sum
, count(*)::int AS day_ct
FROM test_aes
WHERE call_dt_key < (now() AT TIME ZONE 'Asia/Hong_Kong')::date -- see above
GROUP BY call_dt_key
ORDER BY call_dt_key;
The current day is always missing, but that's a feature. Results would be incorrect before the day is over.
The MV needs to be refreshed once per day, before you run your query or the latest day(s) are missing.
An index on the underlying table is not necessary for this, since the whole table is read anyway.
CREATE INDEX test_aes_day_val ON test_aes (call_dt_key, aes_raw);
You might build a smarter materialized view manually and only incrementally add new days instead of recreating everything with standard MVs. But that's beyond the scope of the question ...
I strongly suggest an index on the MV, though:
CREATE INDEX foo ON mv_test_aes (day, day_sum, day_ct);
I only appended day_sum
and day_ct
hoping for index-only scans. If you don't see those in your queries, you don't need the columns in the index.
SELECT t.day
, round(COALESCE(sum(t1.day_sum) * 1.0 / sum(t1.day_ct) -- * 1.0 to avoid int division
, t.day_sum * 1.0 / t.day_ct), 4) AS avg_40days
FROM mv_test_aes t
LEFT JOIN mv_test_aes t1 ON t1.day < t.day
AND t1.day >= t.day - 40 -- not 41
GROUP BY t.day, t.day_sum, t.day_ct
ORDER BY t.day;
Result:
day | avg_40days -----------+------------ 2016-01-01 | 15.0000 2016-01-05 | 15.0000 2016-01-10 | 15.3333 2016-01-15 | 15.2941 2016-01-16 | 15.5000 2016-01-20 | 15.6522 2016-01-21 | 15.6000 2016-01-31 | 15.5556 2016-02-01 | 15.5172 2016-02-10 | 15.4839 2016-02-15 | 15.5556 -- correct results 2016-02-26 | 15.0000 2016-03-04 | 15.0000 2016-03-18 | 15.0000
SQL Fiddle.
If you run this often, I would wrap the whole shebang into a MV to avoid repeated computation.
A solution with window functions and a frame clause ROWS BETWEEN ....
would be possible, too. But your example data suggests that you don't have values for most of the days in the range (many more gaps than islands), so I don't expect it to be faster.
Related:
- Rolling sum / count / average over date interval