Median with GROUP BY

This query could answer your question: median value and group by

            SELECT tag, AVG(val) as median
            FROM 
            (
              SELECT tag, val,
                  (SELECT count(*) FROM median t2 WHERE t2.tag = t3.tag) as ct,
                  seq,
                  (SELECT count(*) FROM median t2 WHERE t2.tag < t3.tag) as delta
                FROM (SELECT tag, val, @rownum := @rownum + 1 as seq
                      FROM (SELECT * FROM median ORDER BY tag, val) t1 
                      ORDER BY tag, seq
                    ) t3 CROSS JOIN (SELECT @rownum := 0) x
                HAVING (ct%2 = 0 and seq-delta between floor((ct+1)/2) and floor((ct+1)/2) +1)
                  or (ct%2 <> 0 and seq-delta = (ct+1)/2)
            ) T
            GROUP BY tag
            ORDER BY tag;

I tried it on this dataset (mainly from here):

Click to copy

            +------+------+
            | tag  | val  |
            +------+------+
            |    1 |    3 |
            |    1 |   13 |

... (see explanation below)

Click to copy

            |    3 |   12 |
            |    3 |   43 |
            |    3 |   15 |
            +------+------+

and the result was:

Click to copy

            +------+---------+
            | tag  | median  |
            +------+---------+
            |    1 | 23.0000 |
            |    2 | 22.0000 |
            |    3 | 15.0000 |
            +------+---------+

Explanation

Inner subqueries will be computed first: sequence is (1)(2)(3)(4).

-- (4) compute the average (of 2 lines or 1 line)

Click to copy

    SELECT tag, AVG(val) as median                          
      FROM 
      (

-- (3) get lines to compute the median value

Click to copy

        SELECT tag, val,                                       
           (SELECT count(*) FROM median t2                    -- +number of lines for the current tag value as ct
              WHERE t2.tag = t3.tag) as ct,
           seq,
           (SELECT count(*) FROM median t2                    -- +number of lines before the current tag value as delta
              WHERE t2.tag < t3.tag) as delta                --     to compute the starting line number of a tag
         FROM (

-- (2) sort dataset by tag and sequence

Click to copy

                SELECT tag, val,                            
                    @rownum := @rownum + 1 as seq       -- +@rownum enable to create a sequence from 0 by 1
              FROM (

-- (1) sort dataset by tag and value

Click to copy

                    SELECT * FROM median           
                    ORDER BY tag, val) t1

-- (2) continue here

Click to copy

              ORDER BY tag, seq
            ) t3 CROSS JOIN (SELECT @rownum := 0) x            -- +use to set @rownum to 0 (no data)

-- (3) continue here

Click to copy

         HAVING (ct%2 = 0                                      -- +when ct is even, select the two lines around the middle
                  and seq-delta between floor((ct+1)/2) 
                                and floor((ct+1)/2) +1)
           or (ct%2 <> 0                                       -- +when ct is odd, select the one line in the middle
                  and seq-delta = (floor(ct+1)/2))
      ) T

-- (4) continue here

Click to copy

      GROUP BY tag
      ORDER BY tag;

Dataset:

Click to copy

        after (1)     after (2)           processing (3)   
    +------+------+                   
    | tag  | val  |  ct  delta  seq       seq-delta
    +------+------+                   
    |    1 |    3 |  15    0     1        1         ct : odd ct%2 <> 0  
    |    1 |    5 |  15    0     2        2         floor((ct+1)/2) : 8
    |    1 |    7 |  15    0     3        3         
    |    1 |   12 |  15    0     4        4         
    |    1 |   13 |  15    0     5        5
    |    1 |   14 |  15    0     6        6
    |    1 |   21 |  15    0     7        7
    |    1 |   23 |  15    0     8        8 ---> keep this line
    |    1 |   23 |  15    0     9        9 
    |    1 |   23 |  15    0     10       10
    |    1 |   23 |  15    0     11       11
    |    1 |   29 |  15    0     12       12
    |    1 |   39 |  15    0     13       13
    |    1 |   40 |  15    0     14       14
    |    1 |   56 |  15    0     15       15

    |    2 |    3 |  14    15    16        1         ct : even (ct%2 = 0  )
    |    2 |    5 |  14    15    17        2         floor((ct+1)/2) : 7
    |    2 |    7 |  14    15    18        3         floor((ct+1)/2)+1 : 8
    |    2 |   12 |  14    15    19        4
    |    2 |   13 |  14    15    20        5
    |    2 |   14 |  14    15    21        6
    |    2 |   21 |  14    15    22        7 ---> keep this line
    |    2 |   23 |  14    15    23        8 ---> keep this line
    |    2 |   23 |  14    15    24        9
    |    2 |   23 |  14    15    25        10
    |    2 |   23 |  14    15    26        11
    |    2 |   29 |  14    15    27        12
    |    2 |   40 |  14    15    28        13
    |    2 |   56 |  14    15    29        14

    |    3 |   12 |  3     29    30        1                  ct : odd ct%2 <> 0 
    |    3 |   15 |  3     29    31        2 ---> keep        floor((ct+1)/2) : 2
    |    3 |   43 |  3     29    32        3
    +------+------+

Dataset after (3)

Click to copy

    +------+------+------+------+-------+
    | tag  | val  | ct   | seq  | delta |
    +------+------+------+------+-------+
    |    1 |   23 |   15 |    8 |     0 |
    |    2 |   21 |   14 |   22 |    15 |
    |    2 |   23 |   14 |   23 |    15 |
    |    3 |   15 |    3 |   31 |    29 |
    +------+------+------+------+-------+

Outer query will compute the avg(val) group by tag value.

Hope this helps.

But what about median computing when there are null values? See EDIT2 below

Alternative: using a function

Click to copy

    DELIMITER //
    CREATE FUNCTION median(pTag int)
        RETURNS real
           READS SQL DATA
           DETERMINISTIC
           BEGIN
              DECLARE r real; -- result
    SELECT AVG(val) INTO r
    FROM 
    (
      SELECT val,
           (SELECT count(*) FROM median WHERE tag = pTag) as ct,
           seq
        FROM (SELECT val, @rownum := @rownum + 1 as seq
              FROM (SELECT * FROM median WHERE tag = pTag ORDER BY val ) t1 
              ORDER BY seq
            ) t3 
            CROSS JOIN (SELECT @rownum := 0) x
        HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
          or (ct%2 <> 0 and seq = (ct+1)/2)
    ) T;
    return r;
    END//
    DELIMITER ;

But the function will be called for each row:

Click to copy

SELECT tag, median(tag) FROM median; -- my test table is 'median' too...

This query will be "better":

Click to copy

select tag, median(tag) 
  from (select distinct tag from median) t;

That's all I can do! Hope it helps!

EDIT2 : about null values in data (column val in the example)

null values show be omited from the source data using a WHERE clause : WHERE val IS NOT NULL, in both 2 subqueriés that count lines and the subquery that gets data.

EDIT3 (LAST EDIT) : change the initialisation of the @rownum position

It should by put at the deepest level : so that it declared the soonest in the excution of the query.

Click to copy

DELIMITER //
CREATE FUNCTION median(pTag int)
    RETURNS real
       READS SQL DATA
       DETERMINISTIC
       BEGIN
          DECLARE r real; -- result
SELECT AVG(val) INTO r
FROM 
(
  SELECT val,
       (SELECT count(*) FROM median WHERE tag = pTag and val is not null)     as ct,
       seq
    FROM (SELECT val, @rownum := @rownum + 1 as seq
          FROM (SELECT * FROM median 
                       CROSS JOIN (SELECT @rownum := 0) x -- INIT @rownum here
             WHERE tag = pTag and val is not null ORDER BY val 
          ) t1 
          ORDER BY seq
       ) t3     
    HAVING (ct%2 = 0 and seq between floor((ct+1)/2) and floor((ct+1)/2) +1)
      or (ct%2 <> 0 and seq = (ct+1)/2)
) T;
return r;
END//
DELIMITER ;

That is the same for the query.

Test with 2 data sets more :

Click to copy

|    4 | NULL |
|    4 |   10 |
|    4 |   15 |
|    4 |   20 |
|    5 | NULL |
|    5 | NULL |
|    5 | NULL |
+------+------+

39 rows in set (0.00 sec)

Click to copy

+------+--------------+
| tag  | median2(tag) |
+------+--------------+
|    1 |           23 |
|    2 |           22 |
|    3 |           15 |
|    4 |           15 |
|    5 |         NULL |
+------+--------------+
5 rows in set (0.08 sec)

Best solutions are found here.

Best approaches for grouped median

It says that there is a new PERCENTILE_CONT() function introduced in SQL Server 2012.

Click to copy

SELECT tag, MAX(Median) AS Median
FROM
(
   SELECT tag,
   PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val) OVER (PARTITION BY tag) AS Median
   FROM t1
) 
AS x
GROUP BY tag;

Median with GROUP BY

Explanation

Alternative: using a function

EDIT2 : about null values in data (column val in the example)

EDIT3 (LAST EDIT) : change the initialisation of the @rownum position

Tags:

Mysql

Group By

Aggregate

Related

Recent Posts