How to optimize very slow SELECT with LEFT JOINs over big tables
Pick a few attributes to include in person
. Index them in a few combinations -- use composite indexes, not single-column indexes.
That is essentially the only way out of EAV-sucks-at-performance, which is where you are.
Here is more discussion: http://mysql.rjweb.org/doc.php/eav including a suggestion of using JSON instead of the key-value table.
I hope I found a sufficient solution. It's inspired by this article.
Short answer:
- I've created 1 table with all the attributes. One column for one attribute. Plus primary key column.
- Attribute values are stored in text cells (for full-text searching) in CSV-like format.
- Created full-text indexes. Before that it's important to set
ft_min_word_len=1
(for MyISAM) in[mysqld]
section andinnodb_ft_min_token_size=1
(for InnoDb) inmy.cnf
file, restart mysql service. - Searching example:
SELECT * FROM person_index WHERE MATCH(attribute_1) AGAINST("123 456 789" IN BOOLEAN MODE) LIMIT 1000
where123
,456
a789
are IDs which persons should have associated inattribute_1
. This query took under 1 sec.
Detailed answer:
Step 1. Creating table with fulltext indexes. InnoDb supports fulltext indexes from MySQL 5.7 so if you use 5.5 or 5.6, you should use MyISAM. It's sometimes even faster for FT searching than InnoDb.
CREATE TABLE `person_attribute_ft` (
`person_id` int(11) NOT NULL,
`attr_1` text,
`attr_2` text,
`attr_3` text,
`attr_4` text,
PRIMARY KEY (`person_id`),
FULLTEXT KEY `attr_1` (`attr_1`),
FULLTEXT KEY `attr_2` (`attr_2`),
FULLTEXT KEY `attr_3` (`attr_3`),
FULLTEXT KEY `attr_4` (`attr_4`),
FULLTEXT KEY `attr_12` (`attr_1`,`attr_2`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Step 2. Insert data from EAV (entity-attribute-value) table. For example stated in question it can be done with 1 simple SQL:
INSERT IGNORE INTO `person_attribute_ft`
SELECT
p.person_id,
(SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 1 AND a.person_id = p.person_id LIMIT 10) attr_1,
(SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 2 AND a.person_id = p.person_id LIMIT 10) attr_2,
(SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 3 AND a.person_id = p.person_id LIMIT 10) attr_3,
(SELECT GROUP_CONCAT(a.attribute_value SEPARATOR ' ') FROM attribute a WHERE a.attribute_type_id = 4 AND a.person_id = p.person_id LIMIT 10) attr_4
FROM person p
Result should be something like this:
mysql> select * from person_attribute_ft limit 10;
+-----------+--------+--------+--------+--------+
| person_id | attr_1 | attr_2 | attr_3 | attr_4 |
+-----------+--------+--------+--------+--------+
| 1 | 541 | 2 | 1927 | 3 |
| 2 | 2862 | 2 | 1939 | 4 |
| 3 | 6573 | 2 | 1904 | 2 |
| 4 | 2432 | 1 | 2005 | 2 |
| 5 | 2208 | 1 | 1995 | 4 |
| 6 | 8388 | 2 | 1973 | 1 |
| 7 | 107 | 2 | 1909 | 4 |
| 8 | 5161 | 1 | 2005 | 1 |
| 9 | 8022 | 2 | 1953 | 4 |
| 10 | 4801 | 2 | 1900 | 3 |
+-----------+--------+--------+--------+--------+
10 rows in set (0.00 sec)
Step 3. Select from table with query like this:
mysql> SELECT SQL_NO_CACHE *
-> FROM `person_attribute_ft`
-> WHERE 1 AND MATCH(attr_1) AGAINST ("3000 3001 3002 3003 3004 3005 3006 3007" IN BOOLEAN MODE)
-> AND MATCH(attr_2) AGAINST ("1" IN BOOLEAN MODE)
-> AND MATCH(attr_3) AGAINST ("1980 1981 1982 1983 1984" IN BOOLEAN MODE)
-> AND MATCH(attr_4) AGAINST ("2,3" IN BOOLEAN MODE)
-> LIMIT 10000;
+-----------+--------+--------+--------+--------+
| person_id | attr_1 | attr_2 | attr_3 | attr_4 |
+-----------+--------+--------+--------+--------+
| 12131 | 3002 | 1 | 1982 | 2 |
| 51315 | 3007 | 1 | 1984 | 2 |
| 147283 | 3001 | 1 | 1984 | 2 |
| 350086 | 3005 | 1 | 1982 | 3 |
| 423907 | 3004 | 1 | 1982 | 3 |
... many rows ...
| 9423907 | 3004 | 1 | 1982 | 3 |
| 9461892 | 3007 | 1 | 1982 | 2 |
| 9516361 | 3006 | 1 | 1980 | 2 |
| 9813933 | 3005 | 1 | 1982 | 2 |
| 9986892 | 3003 | 1 | 1981 | 2 |
+-----------+--------+--------+--------+--------+
90 rows in set (0.17 sec)
The query selects all the rows:
- matching at least one of these IDs in
attr_1
:3000, 3001, 3002, 3003, 3004, 3005, 3006 or 3007
- AND at the same time matching
1
inattr_2
(this column represents gender so if this solution were customized, it should besmallint(1)
with simple index, etc...) - AND at the same time matching at least one of
1980, 1981, 1982, 1983 or 1984
inattr_3
- AND at the same time matching
2
or3
inattr_4
Conclusion:
I know this solution is not perfect and ideal for many situations but can be used as good alternative for EAV table design.
I hope it'll help someone.
Add indeces to attribute
for:
(person_id, attribute_type_id, attribute_value)
and(attribute_type_id, attribute_value, person_id)
Explanation
With your current design EXPLAIN
expects your query to examine 1,265,229 * 4 * 4 * 4 = 80,974,656
rows in attribute
. You can reduce this number by adding a composite index on attribute
for (person_id, attribute_type_id)
. Using this index your query will only examine 1 instead of 4 rows for each of location
, eyecolor
and gender
.
You could extend that index to include attribute_type_value
as well: (person_id, attribute_type_id, attribute_value)
. This would turn this index into a covering index for this query, which should improve performance as well.
Furthermore adding an index on (attribute_type_id, attribute_value, person_id)
(again a covering index by including person_id
) should improve performance over just using an index on attribute_value
where more rows would have to be examined. In this case it will fasten the first step in your explain: selecting a range from bornyear
.
Using those two indeces lowered the execution time of your query on my system from ~2.0 s to ~0.2 s with the explain output looking like this:
+----+-------------+----------+--------+-------------------------------------+-------------------+---------+--------------------------------+---------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+--------+-------------------------------------+-------------------+---------+--------------------------------+---------+----------+--------------------------+
| 1 | SIMPLE | bornyear | range | person_type_value,type_value_person | type_value_person | 9 | | 1861881 | 100.00 | Using where; Using index |
| 1 | SIMPLE | location | ref | person_type_value,type_value_person | person_type_value | 8 | bornyear.person_id,const | 1 | 100.00 | Using where; Using index |
| 1 | SIMPLE | eyecolor | ref | person_type_value,type_value_person | person_type_value | 8 | bornyear.person_id,const | 1 | 100.00 | Using where; Using index |
| 1 | SIMPLE | gender | ref | person_type_value,type_value_person | person_type_value | 13 | bornyear.person_id,const,const | 1 | 100.00 | Using index |
| 1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | bornyear.person_id | 1 | 100.00 | Using index |
+----+-------------+----------+--------+-------------------------------------+-------------------+---------+--------------------------------+---------+----------+--------------------------+