How to speed up "select count(*)" with "group by" and "where"?

Indexing the columns in the GROUP BY clause would be the first thing to try, using a composite index. A query such as this can potentially be answered using only the index data, avoiding the need to scan the table at all. Since the records in the index are sorted, the DBMS should not need to perform a separate sort as part of the group processing. However, the index will slow down updates to the table, so be cautious with this if your table experiences heavy updates.

If you use InnoDB for the table storage, the table's rows will be physically clustered by the primary key index. If that (or a leading portion of it) happens to match your GROUP BY key, that should speed up a query such as this because related records will be retrieved together. Again, this avoids having to perform a separate sort.

In general, bitmap indexes would be another effective alternative, but MySQL does not currently support these, as far as I know.

A materialized view would be another possible approach, but again this is not supported directly in MySQL. However, if you did not require the COUNT statistics to be completely up-to-date, you could periodically run a CREATE TABLE ... AS SELECT ... statement to manually cache the results. This is a bit ugly as it is not transparent, but may be acceptable in your case.

You could also maintain a logical-level cache table using triggers. This table would have a column for each column in your GROUP BY clause, with a Count column for storing the number of rows for that particular grouping key value. Every time a row is added to or updated in the base table, insert or increment/decrement the counter row in the summary table for that particular grouping key. This may be better than the fake materialized view approach, as the cached summary will always be up-to-date, and each update is done incrementally and should have less of a resource impact. I think you would have to watch out for lock contention on the cache table, however.


If you have InnoDB, count(*) and any other aggregate function will do a table scan. I see a few solutions here:

  1. Use triggers and store aggregates in a separate table. Pros: integrity. Cons: slow updates
  2. Use processing queues. Pros: fast updates. Cons: old state can persist until the queue is processed so the user may feel a lack of integrity.
  3. Fully separate the storage access layer and store aggregates in a separate table. The storage layer will be aware of the data structure and can apply deltas instead of doing full counts. For example if you provide an "addObject" functionality within that you will know when an object has been added and thus the aggregate would be affected. Then you do only an update table set count = count + 1. Pros: fast updates, integrity (you may want to use a lock though in case several clients can alter the same record). Cons: you couple a bit of business logic and storage.

Here are several things I'd try, in order of increasing difficulty:

(easier) - Make sure you have the right covering index

CREATE INDEX ix_temp ON relations (relation_title, object_title);

This should maximize perf given your existing schema, since (unless your version of mySQL's optimizer is really dumb!) it will minimize the amount of I/Os needed to satisfy your query (unlike if the index is in the reverse order where the whole index must be scanned) and it will cover the query so you won't have to touch the clustered index.

(a little harder) - make sure your varchar fields are as small as possible

One of the perf challenges with varchar indexes on MySQL is that, when processing a query, the full declared size of the field will be pulled into RAM. So if you have a varchar(256) but are only using 4 chars, you're still paying the 256-byte RAM usage while the query is being processed. Ouch! So if you can shrink your varchar limits easily, this should speed up your queries.

(harder) - Normalize

30% of your rows having a single string value is a clear cry for normalizing into another table so you're not duplicating strings millions of times. Consider normalizing into three tables and using integer IDs to join them.

In some cases, you can normalize under the covers and hide the normalization with views which match the name of the current table... then you only need to make your INSERT/UPDATE/DELETE queries aware of the normalization but can leave your SELECTs alone.

(hardest) - Hash your string columns and index the hashes

If normalizing means changing too much code, but you can change your schema a little bit, you may want to consider creating 128-bit hashes for your string columns (using the MD5 function). In this case (unlike normalization) you don't have to change all your queries, only the INSERTs and some of the SELECTs. Anyway, you'll want to hash your string fields, and then create an index on the hashes, e.g.

CREATE INDEX ix_temp ON relations (relation_title_hash, object_title_hash);

Note that you'll need to play around with the SELECT to make sure you are doing the computation via the hash index and not pulling in the clustered index (required to resolve the actual text value of object_title in order to satisfy the query).

Also, if relation_title has a small varchar size but object title has a long size, then you can potentially hash only object_title and create the index on (relation_title, object_title_hash).

Note that this solution only helps if one or both of these fields is very long relative to the size of the hashes.

Also note that there are interesting case-sensitivity/collation impacts from hashing, since the hash of a lowercase string is not the same as a hash of an uppercase one. So you'll need to make sure you apply canonicalization to the strings before hashing them-- in otherwords, only hash lowercase if you're in a case-insensitive DB. You also may want to trim spaces from the beginning or end, depending on how your DB handles leading/trailing spaces.