mysql slow count in join query
I downloaded your data and tried a few experiments. I'm running MySQL 5.6.12 on a CentOS virtual machine on a Macbook Pro. The times I observed can be used for comparison, but your system may have different performance.
Base case
First I tried without the USE INDEX clauses, because I avoid optimizer overrides where possible. In most cases, a simple query like this should use the correct index if it's available. Hard-coding the index choice in a query makes it harder to use a better index later.
I also use correlation names (table aliases) to make the query more clear.
mysql> EXPLAIN SELECT COUNT(*) FROM content_en AS e
INNER JOIN content_page_categories AS c ON c.itemid = e.id
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: combo,combo2
key: combo
key_len: 4
ref: const
rows: 71198
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: e
type: eq_ref
possible_keys: PRIMARY,combo2,combo
key: PRIMARY
key_len: 4
ref: test.c.itemid
rows: 1
Extra: Using where
- This executed in 0.36 seconds.
Covering index
I'd like to get "Using index" on the second table as well, so I need an index on (active, id) in that order. I had to USE INDEX in this case to persuade the optimizer not to use the primary key.
mysql> ALTER TABLE content_en ADD KEY combo2 (active, id);
mysql> explain SELECT COUNT(*) FROM content_en AS e USE INDEX (combo2)
INNER JOIN content_page_categories AS c ON c.itemid = e.id
WHERE c.catid = 1 AND e.active = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: c
type: ref
possible_keys: combo,combo2
key: combo
key_len: 4
ref: const
rows: 71198
Extra: Using index
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: e
type: ref
possible_keys: combo2
key: combo2
key_len: 6
ref: const,test.c.itemid
rows: 1
Extra: Using where; Using index
The rows
reported by EXPLAIN is an important indicator of how much work it's going to take to execute the query. Notice the rows
in the above EXPLAIN is only 71k, much smaller than the 125k rows you got when you scanned the content_en table first.
- This executed in 0.44 seconds. This is unexpected, because usually a query using a covering index is an improvement.
Convert tables to InnoDB
I tried the same covering index solution as above, but with InnoDB as the storage engine.
mysql> ALTER TABLE content_en ENGINE=InnoDB;
mysql> ALTER TABLE content_page_categories ENGINE=InnoDB;
This had the same EXPLAIN report. It took 1 or 2 iterations to warm the buffer pool, but then the performance of the query tripled.
This executed in 0.16 seconds.
I also tried removing the USE INDEX, and the time increased slightly, to 0.17 seconds.
@Matthew's solution with STRAIGHT_JOIN
mysql> SELECT straight_join count(*)
FROM content_en
INNER JOIN content_page_categories use index (combo)
ON (id = itemid)
WHERE catid = 1 AND active = 1;
- This executed in 0.20 - 0.22 seconds.
@bobwienholt's solution, denormalization
I tried the solution proposed by @bobwienholt, using denormalization to copy the active
attribute to the content_page_categories
table.
mysql> ALTER TABLE content_page_categories ADD COLUMN active TINYINT(1);
mysql> UPDATE content_en JOIN content_page_categories ON id = itemid
SET content_page_categories.active = content_en.active;
mysql> ALTER TABLE content_page_categories ADD KEY combo3 (catid,active);
mysql> SELECT COUNT(*) FROM content_page_categories WHERE catid = 1 and active = 1;
This executed in 0.037 - 0.044 seconds. So this is better, if you can maintain the redundant active
column in sync with the value in the content_en
table.
@Quassnoi's solution, summary table
I tried the solution proposed by @Quassnoi, to maintain a table with precomputed counts per catid and active. The table should have very few rows, and looking up the counts you need are primary key lookups and require no JOINs.
mysql> CREATE TABLE page_active_category (
active INT NOT NULL,
catid INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY (active, catid)
) ENGINE=InnoDB;
mysql> INSERT INTO page_active_category
SELECT e.active, c.catid, COUNT(*)
FROM content_en AS e
JOIN content_page_categories AS c ON c.itemid = e.id
GROUP BY e.active, c.catid
mysql> SELECT cnt FROM page_active_category WHERE active = 1 AND catid = 1
This executed in 0.0007 - 0.0017 seconds. So this is the best solution by an order of magnitude, if you can maintain the table with aggregate counts.
You can see from this that different types of denormalization (including a summary table) is an extremely powerful tool for the sake of performance, though it has drawbacks because maintaining the redundant data can be inconvenient and makes your application more complex.
The problem is the "active" column in content_en. Obviously, if you just needed to know how many content records were related to a particular category (active or not) all you would have to do is:
SELECT count(1)
FROM content_page_categories
WHERE catid = 1;
Having to join back to every content_en record just to read the "active" flag is really what is slowing this query down.
I recommend adding "active" to content_page_categories and making it a copy of the related value in content_en... you can keep this column up to date using triggers or in your code. Then you can change the combo index to be:
KEY combo (catid,active,itemid)
and rewrite your query to:
SELECT count(1)
FROM content_page_categories USE INDEX (combo)
WHERE catid = 1 AND active = 1;
Also, you may have much better luck using InnoDB tables instead of MyISAM. Just be sure to tune your InnoDB settings: http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/
There are too many records to count.
If you want a faster solution, you'll have to store aggregate data.
MySQL does not support materialized views (or indexed views in SQL Server's terms) so you would need to create and maintain them yourself.
Create a table:
CREATE TABLE
page_active_category
(
active INT NOT NULL,
catid INT NOT NULL,
cnt BIGINT NOT NULL,
PRIMARY KEY
(active, catid)
) ENGINE=InnoDB;
then populate it:
INSERT
INTO page_active_category
SELECT active, catid, COUNT(*)
FROM content_en
JOIN content_page_categories
ON itemid = id
GROUP BY
active, catid
Now, each time you insert, delete or update a record in either content_en
or content_page_categories
, you should update the appropriate record in page_active_category
.
This is doable with two simple triggers on both content_en
and content_page_categories
.
This way, your original query may be rewritten as mere:
SELECT cnt
FROM page_active_category
WHERE active = 1
AND catid = 1
which is a single primary key lookup and hence instant.