Median with GROUP BY
This query could answer your question: median value and group by
SELECT tag, AVG(val) as median
FROM
(
SELECT tag, val,
(SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median ORDER BY tag, val) t1
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x
HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq-delta = (ct+1)/2)
) T
GROUP BY tag
ORDER BY tag;
I tried it on this dataset (mainly from here):
+------+------+
| tag | val |
+------+------+
| 1 | 3 |
| 1 | 13 |
... (see explanation below)
| 3 | 12 |
| 3 | 43 |
| 3 | 15 |
+------+------+
and the result was:
+------+---------+
| tag | median |
+------+---------+
| 1 | 23.0000 |
| 2 | 22.0000 |
| 3 | 15.0000 |
+------+---------+
Explanation
Inner subqueries will be computed first: sequence is (1)(2)(3)(4).
-- (4) compute the average (of 2 lines or 1 line)
SELECT tag, AVG(val) as median
FROM
(
-- (3) get lines to compute the median value
SELECT tag, val,
(SELECT count(*) FROM median t2 -- +number of lines for the current tag value as ct
WHERE t2.tag = t3.tag) as ct,
seq,
(SELECT count(*) FROM median t2 -- +number of lines before the current tag value as delta
WHERE t2.tag < t3.tag) as delta -- to compute the starting line number of a tag
FROM (
-- (2) sort dataset by tag and sequence
SELECT tag, val,
@rownum := @rownum + 1 as seq -- +@rownum enable to create a sequence from 0 by 1
FROM (
-- (1) sort dataset by tag and value
SELECT * FROM median
ORDER BY tag, val) t1
-- (2) continue here
ORDER BY tag, seq
) t3 CROSS JOIN (SELECT @rownum := 0) x -- +use to set @rownum to 0 (no data)
-- (3) continue here
HAVING (ct%2 = 0 -- +when ct is even, select the two lines around the middle
and seq-delta between floor((ct+1)/2)
and floor((ct+1)/2) +1)
or (ct%2 <> 0 -- +when ct is odd, select the one line in the middle
and seq-delta = (floor(ct+1)/2))
) T
-- (4) continue here
GROUP BY tag
ORDER BY tag;
Dataset:
after (1) after (2) processing (3)
+------+------+
| tag | val | ct delta seq seq-delta
+------+------+
| 1 | 3 | 15 0 1 1 ct : odd ct%2 <> 0
| 1 | 5 | 15 0 2 2 floor((ct+1)/2) : 8
| 1 | 7 | 15 0 3 3
| 1 | 12 | 15 0 4 4
| 1 | 13 | 15 0 5 5
| 1 | 14 | 15 0 6 6
| 1 | 21 | 15 0 7 7
| 1 | 23 | 15 0 8 8 ---> keep this line
| 1 | 23 | 15 0 9 9
| 1 | 23 | 15 0 10 10
| 1 | 23 | 15 0 11 11
| 1 | 29 | 15 0 12 12
| 1 | 39 | 15 0 13 13
| 1 | 40 | 15 0 14 14
| 1 | 56 | 15 0 15 15
| 2 | 3 | 14 15 16 1 ct : even (ct%2 = 0 )
| 2 | 5 | 14 15 17 2 floor((ct+1)/2) : 7
| 2 | 7 | 14 15 18 3 floor((ct+1)/2)+1 : 8
| 2 | 12 | 14 15 19 4
| 2 | 13 | 14 15 20 5
| 2 | 14 | 14 15 21 6
| 2 | 21 | 14 15 22 7 ---> keep this line
| 2 | 23 | 14 15 23 8 ---> keep this line
| 2 | 23 | 14 15 24 9
| 2 | 23 | 14 15 25 10
| 2 | 23 | 14 15 26 11
| 2 | 29 | 14 15 27 12
| 2 | 40 | 14 15 28 13
| 2 | 56 | 14 15 29 14
| 3 | 12 | 3 29 30 1 ct : odd ct%2 <> 0
| 3 | 15 | 3 29 31 2 ---> keep floor((ct+1)/2) : 2
| 3 | 43 | 3 29 32 3
+------+------+
Dataset after (3)
+------+------+------+------+-------+
| tag | val | ct | seq | delta |
+------+------+------+------+-------+
| 1 | 23 | 15 | 8 | 0 |
| 2 | 21 | 14 | 22 | 15 |
| 2 | 23 | 14 | 23 | 15 |
| 3 | 15 | 3 | 31 | 29 |
+------+------+------+------+-------+
Outer query will compute the avg(val) group by tag value.
Hope this helps.
But what about median computing when there are null values? See EDIT2 below
Alternative: using a function
DELIMITER //
CREATE FUNCTION median(pTag int)
RETURNS real
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM
(
SELECT val,
(SELECT count(*) FROM median WHERE tag = pTag) as ct,
seq
FROM (SELECT val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1
ORDER BY seq
) t3
CROSS JOIN (SELECT @rownum := 0) x
HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;
But the function will be called for each row:
SELECT tag, median(tag) FROM median; -- my test table is 'median' too...
This query will be "better":
select tag, median(tag)
from (select distinct tag from median) t;
That's all I can do! Hope it helps!
EDIT2 : about null values in data (column val in the example)
null values show be omited from the source data using a WHERE clause : WHERE val IS NOT NULL
, in both 2 subqueriés that count lines and the subquery that gets data.
EDIT3 (LAST EDIT) : change the initialisation of the @rownum position
It should by put at the deepest level : so that it declared the soonest in the excution of the query.
DELIMITER //
CREATE FUNCTION median(pTag int)
RETURNS real
READS SQL DATA
DETERMINISTIC
BEGIN
DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM
(
SELECT val,
(SELECT count(*) FROM median WHERE tag = pTag and val is not null) as ct,
seq
FROM (SELECT val, @rownum := @rownum + 1 as seq
FROM (SELECT * FROM median
CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
WHERE tag = pTag and val is not null ORDER BY val
) t1
ORDER BY seq
) t3
HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;
That is the same for the query.
Test with 2 data sets more :
| 4 | NULL |
| 4 | 10 |
| 4 | 15 |
| 4 | 20 |
| 5 | NULL |
| 5 | NULL |
| 5 | NULL |
+------+------+
39 rows in set (0.00 sec)
+------+--------------+
| tag | median2(tag) |
+------+--------------+
| 1 | 23 |
| 2 | 22 |
| 3 | 15 |
| 4 | 15 |
| 5 | NULL |
+------+--------------+
5 rows in set (0.08 sec)
Best solutions are found here.
Best approaches for grouped median
It says that there is a new PERCENTILE_CONT() function introduced in SQL Server 2012.
SELECT tag, MAX(Median) AS Median
FROM
(
SELECT tag,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val) OVER (PARTITION BY tag) AS Median
FROM t1
)
AS x
GROUP BY tag;