SELECT DISTINCT on multiple columns

Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).

A very simple query would be to use UNION DISTINCT. ~~I think it would be most efficient if there is a separate index on each of the four columns~~ It would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):

Click to copy

-- Query 1. (334 ms, 368ms) 
SELECT a AS abcd FROM tablename 
UNION                           -- means UNION DISTINCT
SELECT b FROM tablename 
UNION 
SELECT c FROM tablename 
UNION 
SELECT d FROM tablename ;

Another would be to first UNION ALL and then use DISTINCT. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:

Click to copy

-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
  ( SELECT a FROM tablename 
    UNION ALL 
    SELECT b FROM tablename 
    UNION ALL
    SELECT c FROM tablename 
    UNION ALL
    SELECT d FROM tablename 
  ) AS x ;

The other answers have provided with more options using array functions or the LATERAL syntax. Jack's query (187 ms, 261 ms) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms). Both of them do one sequential scan of the table and do not use any index.

Actually Jack's query results are a bit better than shown above (if we remove the order by) and can be further improved by removing the 4 internal distinct and leaving only the external one.

Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the WITH RECURSIVE hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:

Click to copy

-- Query 3.  (2 ms, 368ms)
WITH RECURSIVE 
    da AS (
       SELECT min(a) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(a) FROM observations
               WHERE  a > s.n)
       FROM   da AS s  WHERE s.n IS NOT NULL  ),
    db AS (
       SELECT min(b) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(b) FROM observations
               WHERE  b > s.n)
       FROM   db AS s  WHERE s.n IS NOT NULL  ),
   dc AS (
       SELECT min(c) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(c) FROM observations
               WHERE  c > s.n)
       FROM   dc AS s  WHERE s.n IS NOT NULL  ),
   dd AS (
       SELECT min(d) AS n  FROM observations
       UNION ALL
       SELECT (SELECT min(d) FROM observations
               WHERE  d > s.n)
       FROM   db AS s  WHERE s.n IS NOT NULL  )
SELECT n 
FROM 
( TABLE da  UNION 
  TABLE db  UNION 
  TABLE dc  UNION 
  TABLE dd
) AS x 
WHERE n IS NOT NULL ;

SQLfiddle

To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.

Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:

Click to copy

-- Query 1b.  (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations 
UNION 
SELECT DISTINCT b FROM observations 
UNION 
SELECT DISTINCT c FROM observations 
UNION 
SELECT DISTINCT d FROM observations ;

and Jack's improved:

Click to copy

-- Query 4b.  (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
                        array_agg(b)||
                        array_agg(c)||
                        array_agg(d) )
from t ;

You could use LATERAL, like in this query:

Click to copy

SELECT DISTINCT
  x.n
FROM
  atable
  CROSS JOIN LATERAL (
    VALUES (a), (b), (c), (d)
  ) AS x (n)
;

The LATERAL keyword allows the right side of the join to reference objects from the left side. In this case, the right side is a VALUES constructor that builds a single-column subset out of the column values you want to put into a single column. The main query simply references the new column, also applying DISTINCT to it.

To be clear, I'd use union as ypercube suggests, but it is also possible with arrays:

Click to copy

select distinct unnest( array_agg(distinct a)||
                        array_agg(distinct b)||
                        array_agg(distinct c)||
                        array_agg(distinct d) )
from t
order by 1;

Click to copy

| unnest |
| :----- |
| 0      |
| 1      |
| 2      |
| 3      |
| 5      |
| 6      |
| 8      |
| 9      |

dbfiddle here

SELECT DISTINCT on multiple columns

Tags:

Performance

Postgresql

Distinct

Postgresql Performance

Postgresql 9.4

Related

Recent Posts