Optimizing a Postgres query with a large IN
Instead of using a huge IN
-list, join on a VALUES
expression, or if the list is large enough, use a temp table, index it, then join on it.
It'd be nice if PostgreSQL could do this internally & automatically but at this point the planner doesn't know how.
Similar topics:
- https://stackoverflow.com/q/24647503/398670
- https://stackoverflow.com/q/17813492/398670
There are actually two different variants of the IN
construct in Postgres. One works with a subquery expression (returning a set), the other one with a list of values, which is just shorthand for
expression = value1
OR
expression = value2
OR
...
You are using the second form, which is fine for a short list, but much slower for long lists. Provide your list of values as subquery expression instead. I was recently made aware of this variant:
WHERE "Post"."userId" IN (VALUES (201486), (1825186), (998608), ... )
I like to pass an array, unnest and join to it. Similar performance, but the syntax is shorter:
...
FROM unnest('{201486,1825186,998608, ...}'::int[]) "userId"
JOIN "Posts" "Post" USING ("userId")
Equivalent as long as there are no duplicates in the provided set / array. Else the second form with a JOIN
returns duplicate rows, while the first with IN
only returns a single instance. This subtle difference causes different query plans, too.
Obviously, you need an index on "Posts"."userId"
.
For very long lists (thousands), go with an indexed temp table like @Craig suggested. This allows combined bitmap index scans over both tables, which is typically faster as soon as there are multiple tuples per data page to fetch from disk.
Related:
- How to use ANY instead of IN in a WHERE clause with Rails?
Aside: your naming convention is not very helpful, makes your code verbose and hard to read. Rather use legal, lower-case, unquoted identifiers.