SELECT DISTINCT on multiple columns
Update: Tested all 5 queries in SQLfiddle with 100K rows (and 2 separate cases, one with few (25) distinct values and another with lots (around 25K values).
A very simple query would be to use UNION DISTINCT
. I think it would be most efficient if there is a separate index on each of the four columns It would be efficient with a separate index on each of the four columns, if Postgres had implemented Loose Index Scan optimization, which it hasn't. So this query will not be efficient as it requires 4 scans of the table (and no index is used):
-- Query 1. (334 ms, 368ms)
SELECT a AS abcd FROM tablename
UNION -- means UNION DISTINCT
SELECT b FROM tablename
UNION
SELECT c FROM tablename
UNION
SELECT d FROM tablename ;
Another would be to first UNION ALL
and then use DISTINCT
. This will also require 4 table scans (and no use of indexes). Not bad efficiency when the values are few, and with more values becomes the fastest in my (not extensive) test:
-- Query 2. (87 ms, 117 ms)
SELECT DISTINCT a AS abcd
FROM
( SELECT a FROM tablename
UNION ALL
SELECT b FROM tablename
UNION ALL
SELECT c FROM tablename
UNION ALL
SELECT d FROM tablename
) AS x ;
The other answers have provided with more options using array functions or the LATERAL
syntax. Jack's query (187 ms, 261 ms
) has reasonable performance but AndriyM's query seems more efficient (125 ms, 155 ms
). Both of them do one sequential scan of the table and do not use any index.
Actually Jack's query results are a bit better than shown above (if we remove the order by
) and can be further improved by removing the 4 internal distinct
and leaving only the external one.
Finally, if - and only if - the distinct values of the 4 columns are relatively few, you can use the WITH RECURSIVE
hack/optimization described in the above Loose Index Scan page and use all 4 indexes, with remarkably fast result! Tested with the same 100K rows and approximately 25 distinct values spread across the 4 columns (runs in only 2 ms!) while with 25K distinct values it's the slowest with 368 ms:
-- Query 3. (2 ms, 368ms)
WITH RECURSIVE
da AS (
SELECT min(a) AS n FROM observations
UNION ALL
SELECT (SELECT min(a) FROM observations
WHERE a > s.n)
FROM da AS s WHERE s.n IS NOT NULL ),
db AS (
SELECT min(b) AS n FROM observations
UNION ALL
SELECT (SELECT min(b) FROM observations
WHERE b > s.n)
FROM db AS s WHERE s.n IS NOT NULL ),
dc AS (
SELECT min(c) AS n FROM observations
UNION ALL
SELECT (SELECT min(c) FROM observations
WHERE c > s.n)
FROM dc AS s WHERE s.n IS NOT NULL ),
dd AS (
SELECT min(d) AS n FROM observations
UNION ALL
SELECT (SELECT min(d) FROM observations
WHERE d > s.n)
FROM db AS s WHERE s.n IS NOT NULL )
SELECT n
FROM
( TABLE da UNION
TABLE db UNION
TABLE dc UNION
TABLE dd
) AS x
WHERE n IS NOT NULL ;
SQLfiddle
To summarize, when the distinct values are few, the recursive query is the absolute winner while with lots of values, my 2nd one, Jack's (improved version below) and AndriyM's queries are the best performers.
Late additions, a variation on the 1st query which despite the extra distinct operations, performs much better than the original 1st and only slightly worse than the 2nd:
-- Query 1b. (85 ms, 149 ms)
SELECT DISTINCT a AS n FROM observations
UNION
SELECT DISTINCT b FROM observations
UNION
SELECT DISTINCT c FROM observations
UNION
SELECT DISTINCT d FROM observations ;
and Jack's improved:
-- Query 4b. (104 ms, 128 ms)
select distinct unnest( array_agg(a)||
array_agg(b)||
array_agg(c)||
array_agg(d) )
from t ;
You could use LATERAL, like in this query:
SELECT DISTINCT
x.n
FROM
atable
CROSS JOIN LATERAL (
VALUES (a), (b), (c), (d)
) AS x (n)
;
The LATERAL keyword allows the right side of the join to reference objects from the left side. In this case, the right side is a VALUES constructor that builds a single-column subset out of the column values you want to put into a single column. The main query simply references the new column, also applying DISTINCT to it.
To be clear, I'd use union
as ypercube suggests, but it is also possible with arrays:
select distinct unnest( array_agg(distinct a)|| array_agg(distinct b)|| array_agg(distinct c)|| array_agg(distinct d) ) from t order by 1;
| unnest | | :----- | | 0 | | 1 | | 2 | | 3 | | 5 | | 6 | | 8 | | 9 |
dbfiddle here