Efficient merging (removing duplicates) of arrays

Correct results?

First off: correctness. You want to produce an array of unique elements? Your current query does not do that. The function uniq() from the intarray module only promises to:

remove adjacent duplicates

Like instructed in the manual, you would need:

Click to copy

SELECT l.d + r.d, uniq(sort(array_agg_mult(r.arr)))
FROM   ...

Also gives you sorted arrays - assuming you want that, you did not clarify.

I see you have sort() in your fiddle, so this may just be a typo in your question.

Postgres 9.5 or later

Either way,since Postgres 9.5 array_agg() has the capabilities of my array_agg_mult() built-in out of the box, and much faster, too:

Selecting data into a Postgres array
Is there something like a zip() function in PostgreSQL that combines two arrays?

There have also been other performance improvements for array handling.

Query

The main purpose of array_agg_mult() is to aggregate multi-dimensional arrays, but you only produce 1-dimensional arrays anyway. So I would at least try this alternative query:

Click to copy

SELECT l.d + r.d AS d_sum, array_agg(DISTINCT elem) AS result_arr
FROM   left2  l
JOIN   right2 r USING (t1)
     , unnest(r.arr) elem
GROUP  BY 1
ORDER  BY 1;

Which also addresses your question:

Can the aggregate function remove duplicates directly?

Yes, it can, with DISTINCT. But that's not faster than uniq() for integer arrays, which has been optimized for integer arrays, while DISTINCT is generic for all qualifying data types.

Doesn't require the intarray module. However, the result is not necessarily sorted. Postgres uses varying algorithms for DISTINCT. Big sets are typically hashed, which leaves the result unsorted unless you add explicit ORDER BY. If you need sorted arrays, you could add ORDER BY to the aggregate function directly:

Click to copy

array_agg(DISTINCT elem ORDER BY elem)

But that's typically slower than feeding pre-sorted data to array_agg() (one big sort versus many small sorts). So I would sort in a subquery and then aggregate:

Click to copy

SELECT d_sum, uniq(array_agg(elem)) AS result_arr
FROM  (
   SELECT l.d + r.d AS d_sum, elem
   FROM   left2  l
   JOIN   right2 r USING (t1)
        , unnest(r.arr) elem
   ORDER  BY 1, 2
   ) sub
GROUP  BY 1
ORDER  BY 1;

This was the fastest variant in my cursory test on Postgres 9.4.

SQL Fiddle based on the one you provided.

Index

I don't see much potential for any index here. The only option would be:

Click to copy

CREATE INDEX ON right2 (t1, arr);

Only makes sense if you get index-only scans out of this - which will happen if the underlying table right2 is substantially wider than just these two columns and your setup qualifies for index-only scans. Details in the Postgres Wiki.

Efficient merging (removing duplicates) of arrays

Correct results?

Postgres 9.5 or later

Query

Index

Tags:

Postgresql

Aggregate

Postgresql 9.3

Array

Related

Recent Posts