How can I further optimize a derived table query which performs better than the JOINed equivalent?
Well, I found a solution. It took a lot of experimentation, and I think a good bit of blind luck, but here it is:
CREATE TABLE magic ENGINE=MEMORY
SELECT
s.shop_id AS shop_id,
s.id AS shift_id,
st.dow AS dow,
st.start AS start,
st.end AS end,
su.user_id AS manager_id
FROM shifts s
JOIN shift_times st ON s.id = st.shift_id
JOIN shifts_users su ON s.id = su.shift_id
JOIN shift_positions sp ON su.shift_position_id = sp.id AND sp.level = 1
ALTER TABLE magic ADD INDEX (shop_id, dow);
CREATE TABLE tickets_extra ENGINE=MyISAM
SELECT
t.id AS ticket_id,
(
SELECT m.manager_id
FROM magic m
WHERE DAYOFWEEK(t.created) = m.dow
AND TIME(t.created) BETWEEN m.start AND m.end
AND m.shop_id = t.shop_id
) AS manager_created,
(
SELECT m.manager_id
FROM magic m
WHERE DAYOFWEEK(t.resolved) = m.dow
AND TIME(t.resolved) BETWEEN m.start AND m.end
AND m.shop_id = t.shop_id
) AS manager_resolved
FROM tickets t;
DROP TABLE magic;
Lengthy Explanation
Now, I'll explain why this works, and my relative though process and steps to get here.
First, I knew the query I was trying was suffering because of the huge derived table, and the subsequent JOINs onto this. I was taking my well-indexed tickets table and joining all the shift_times data onto it, then letting MySQL chew on that while it attempts to join the shifts and shift_positions table. This derived behemoth would be up to a 2 million row unindexed mess.
Now, I knew this was happening. The reason I was going down this road though was because the "proper" way to do this, using strictly JOINs was taking an even longer amount of time. This is due to the nasty bit of chaos required to determine who the manager of a given shift is. I have to join down to shift_times to find out what the correct shift even is, while simultaneously joining down to shift_positions to figure out the user's level. I don't think the MySQL optimizer handles this very well, and ends up creating a HUGE monstrosity of a temporary table of the joins, then filtering out what doesn't apply.
So, as the derived table seemed to be the "way to go" I stubbornly persisted in this for a while. I tried punting it down into a JOIN clause, no improvement. I tried creating a temporary table with the derived table in it, but again it was too slow as the temp table was unindexed.
I came to realize that I had to handle this calculation of shift, times, positions sanely. I thought, maybe a VIEW would be the way to go. What if I created a VIEW that contained this information: (shop_id, shift_id, dow, start, end, manager_id). Then, I would simply have to join the tickets table by shop_id and the whole DAYOFWEEK/TIME calculation, and I'd be in business. Of course, I failed to remember that MySQL handles VIEWs rather assily. It doesn't materialize them at all, it simply runs the query you would have used to get the view for you. So by joining tickets onto this, I was essentially running my original query - no improvement.
So, instead of a VIEW I decided to use a TEMPORARY TABLE. This worked well if I only fetched one of the managers (created or resolved) at a time, but it was still pretty slow. Also, I found out that with MySQL you can't refer to the same table twice in the same query (I would have to join my temporary table twice to be able to differentiate between manager_created and manager_resolved). This is a big WTF, as I can do it as long as I don't specify "TEMPORARY" - this is where the CREATE TABLE magic ENGINE=MEMORY came into play.
With this pseudo temporary table in hand, I tried my JOIN for just manager_created again. It performed well, but still rather slow. Yet, when I JOINed again to get manager_resolved in the same query the query time ticked back up into the stratosphere. Looking at the EXPLAIN showed the full table scan of tickets (rows ~2mln), as expected, and the JOINs onto the magic table at ~2,087 each. Again, I seemed to be running into fail.
I now began to think about how to avoid the JOINs altogether and that's when I found some obscure ancient message board post where someone suggested using subselects (can't find the link in my history). This is what led to the second SELECT query shown above (the tickets_extra creation one). In the case of selecting just a single manager field, it performed well, but again with both it was crap. I looked at the EXPLAIN and saw this:
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 173825
Extra:
*************************** 2. row ***************************
id: 3
select_type: DEPENDENT SUBQUERY
table: m
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2037
Extra: Using where
*************************** 3. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: m
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 2037
Extra: Using where
3 rows in set (0.00 sec)
Ack, the dreaded DEPENDENT SUBQUERY. It's often suggested to avoid these, as MySQL will usually execute them in an outside-in fashion, executing the inner query for every row of the outer. I ignored this, and wondered: "Well... what if I just indexed this stupid magic table?". Thus, the ADD index (shop_id, dow) was born.
Check this out:
mysql> CREATE TABLE magic ENGINE=MEMORY
<snip>
Query OK, 3220 rows affected (0.40 sec)
mysql> ALTER TABLE magic ADD INDEX (shop_id, dow);
Query OK, 3220 rows affected (0.02 sec)
mysql> CREATE TABLE tickets_extra ENGINE=MyISAM
<snip>
Query OK, 1933769 rows affected (24.18 sec)
mysql> drop table magic;
Query OK, 0 rows affected (0.00 sec)
Now THAT'S what I'm talkin' about!
Conclusion
This is definitely the first time I've created a non-TEMPORARY table on the fly, and INDEXed it on the fly, simply to do a single query efficiently. I guess I always assumed that adding an index on the fly is a prohibitively expensive operation. (Adding an index on my tickets table of 2mln rows can take over an hour). Yet, for a mere 3,000 rows this is a cakewalk.
Don't be afraid of DEPENDENT SUBQUERIES, creating TEMPORARY tables that really aren't, indexing on the fly, or aliens. They can all be good things in the right situation.
Thanks for all the help StackOverflow. :-D
You should have used Postgres, lol. A simple query like this should not take more than some tens of seconds provided you have enough RAM to avoid disk thrashing.
Anyway.
=> Is the problem in the SELECT or the INSERT ?
(run the SELECT alone on a test server and time it).
=> Is your query disk bound or CPU bound ?
Launch it on a test server and check vmstat output. If it is CPU bound, skip this. If it is disk bound, check the working set size (ie the size of your database). If the working set is smaller than your RAM, it should not be disk bound. You can force loading of a table in the OS cache prior to executing a query by launching a dummy select like SELECT sum( some column ) FROM table. This can be useful if a query selects many rows in random order from a table which is not cached in RAM... you trigger a sequential scan of the table, which loads it in cache, then random access is much faster. With some trickery you can also cache indexes (or just tar your database directory to >/dev/null, lol).
Of course, adding more RAM could help (but you need to check if the query is killing the disk or the CPU first). Or telling MySQL to use more of your RAM in the configuration (key_buffer, etc).
If you are making millions of random HDD seeks you are in PAIN.
=> OK now the query
FIRST, ANALYZE your tables.
LEFT JOIN shift_positions ON su.shift_position_id = shift_positions.id WHERE shift_positions.level = 1
WHY do you LEFT JOIN and then add a WHERE on it ? The LEFT makes no sense. If there is no row in shift_positions, LEFT JOIN will generate a NULL, and the WHERE will reject it.
Solution : use JOIN instead of LEFT JOIN and move (level=1) in the JOIN ON() condition.
While you're at it, also get rid of the other LEFT JOIN (replace by JOIN) unless you are really interested in all those NULLs ? (I guess you are not).
Now you probably can get rid of the subselect.
Next.
WHERE TIME(t.created) BETWEEN shift_times.start AND shift_times.end)
This is not indexable, cause you have a function TIME() in the condition (use Postgres, lol). Lets look at it :
JOIN shift_times ON (shifts.id = shift_times.shift_id AND shift_times.dow = DAYOFWEEK(t.created) AND TIME(t.created) BETWEEN shift_times.start AND shift_times.end)
Ideally you would like to have a multicolumn index on shift_times(shift_id, DAYOFWEEK(t.created),TIME(t.created)) so this JOIN can be indexed.
Solution : add columns 'day','time' to shift_times, containing DAYOFWEEK(t.created),TIME(t.created), filled with correct values using a trigger firing on INSERT or UPDATE.
Now create multicolumn index on (shift_id,day,time)