Postgres is performing sequential scan instead of index scan

This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.

Your query could be rewritten (the LATERAL needs 9.3+ version):

Click to copy

WITH RECURSIVE pa AS 
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 
  UNION ALL
    SELECT n.labelDate 
    FROM pa AS p
         , LATERAL 
              ( SELECT labelDate 
                FROM pages 
                WHERE labelDate > p.labelDate 
                ORDER BY labelDate 
                LIMIT 1
              ) AS n
) 
SELECT labelDate 
FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user

The best query very much depends on data distribution.

You have many rows per date, that's been established. Since your case burns down to only 26 values in the result, all of the following solutions will be blazingly fast as soon as the index is used.
(For more distinct values the case would get more interesting.)

There is no need to involve pageid at all (like you commented).

Index

All you need is a simple btree index on "labelDate".
With more than a few NULL values in the column, a partial index helps some more (and is smaller):

Click to copy

CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate")
WHERE  "labelDate" IS NOT NULL;

You later clarified:

0% NULL but only after fixing things up when importing.

The partial index may still make sense to rule out intermediary states of rows with NULL values. Would avoid needless updates to the index (with resulting bloat).

Query

Based on a provisional range

If your dates appear in a continuous range with not too many gaps, we can use the nature of the data type date to our advantage. There's only a finite, countable number of values between two given values. If the gaps are few, this will be fastest:

Click to copy

SELECT d."labelDate"
FROM  (
   SELECT generate_series(min("labelDate")::timestamp
                        , max("labelDate")::timestamp
                        , interval '1 day')::date AS "labelDate"
   FROM   pages
   ) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Why the cast to timestamp in generate_series()? See:

Generating time series between two dates in PostgreSQL

Min and max can be picked from the index cheaply. If you know the minimum and / or maximum possible date, it gets a bit cheaper, yet. Example:

Click to copy

SELECT d."labelDate"
FROM  (SELECT date '2011-01-01' + g AS "labelDate"
       FROM   generate_series(0, now()::date - date '2011-01-01' - 1) g) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Or, for an immutable interval:

Click to copy

SELECT d."labelDate"
FROM  (SELECT date '2011-01-01' + g AS "labelDate"
       FROM generate_series(0, 363) g) d
WHERE  EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate");

Loose index scan

This performs very well with any distribution of dates (as long as we have many rows per date). Basically what @ypercube already provided. But there are some fine points and we need to make sure our favorite index can be used everywhere.

Click to copy

WITH RECURSIVE p AS (
   ( -- parentheses required for LIMIT
   SELECT "labelDate"
   FROM   pages
   WHERE  "labelDate" IS NOT NULL
   ORDER  BY "labelDate"
   LIMIT  1
   ) 
   UNION ALL
   SELECT (SELECT "labelDate" 
           FROM   pages 
           WHERE  "labelDate" > p."labelDate" 
           ORDER  BY "labelDate" 
           LIMIT  1)
   FROM   p
   WHERE  "labelDate" IS NOT NULL
   ) 
SELECT "labelDate" 
FROM   p
WHERE  "labelDate" IS NOT NULL;

The first CTE p is effectively the same as

Click to copy
```
SELECT min("labelDate") FROM pages
```
But the verbose form makes sure our partial index is used. Plus, this form is typically a bit faster in my experience (and in my tests).
For only a single column, correlated subqueries in the recursive term of the rCTE should be a bit faster. This requires to exclude rows resulting in NULL for "labelDate". See:
Optimize GROUP BY query to retrieve latest record per user

Asides

Unquoted, legal, lower case identifiers make your life easier.
Order columns in your table definition favorably to save some disk space:

Calculating and saving space in PostgreSQL

Postgres is performing sequential scan instead of index scan

Index

Query

Based on a provisional range

Loose index scan

Asides

Tags:

Performance

Postgresql

Query Performance

Postgresql 9.4

Index

Related

Recent Posts