MySQL indexes - what are the best practices according to this table and queries
One important general note as to why your query isn't getting any faster despite your attempts is that DESC
on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.
In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.
You have a few options, and all are actually pretty decent and can probably help you see significant improvement.
A note on SQL
First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.
It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.
absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.
maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN
as @Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published
column. Keep in mind that every INDEX
adds to the work your queries need to do.
Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.
Options, options, options...
There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.
- Split your SQL table into two or more separate tables.
This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon
, [action]edby_id
, and [action]ed
could easily be put into another table, actions
:
+-----------+-------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| action_id | int(11) | NO | | NULL | |
| action | varchar(45) | NO | | NULL | |
| date | datetime | NO | | CURRENT_TIMESTAMP | |
| user_id | int(11) | NO | | NULL | |
+-----------+-------------+------+-----+-------------------+----------------+
The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER
. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created
, but also by all of your other actions.
Edit: As requested, here is a sample sorting query
SELECT * FROM listings
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published')
AND (listings.published = 1)
AND (listings.cat_id in(1,2,3,4,5))
AND (listings.source_id in(1,2,3,4,5))
AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC
Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!
If you put a composite key on actiondate
and listings.id
, this should help to increase speed.
As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:
- Create a month field
I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.
So another option is for you to split your integer function into a month
, day
, and year
field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN
on even a simple query and you'll see for yourself.
That way, you can just index the month and year fields and do a query like this:
SELECT * FROM listings
WHERE (publishedmonth = 9)
AND (publishedyear = 2015)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
Slap an EXPLAIN
in front and you should see massive improvements.
Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.
Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.
- Do your heavy sorting elsewhere
This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.
The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.
There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.
If I were you, I'd at least INDEX
the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.
Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.
EXPLAIN SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
This query:
SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
(published = 1) AND
(cat_id in (1,2,3,4,5)) AND
(source_id in (1,2,3,4,5));
Is hard to optimize with only indexes. The best index is one that starts with published
and then has the other columns -- it is not clear what their order should be. The reason is because all but published
are not using =
.
Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE
clause before the index can be used for the ORDER BY
. That makes this hard to optimize.
Suggestions . . . None are that great:
- If you are going to access the data by month, then you might consider partitioning the data by month. That will make the query without the
ORDER BY
faster, but won't help theORDER BY
. - Try various orders of columns after
published
in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting. - Think about ways that you can structure the query to have more equality conditions in the
WHERE
clause or to return a smaller set of data. - (Not really recommended) Put an index on
published
and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN
and so on) in the outer query. The subquery will use the index for sorting and then filter the results.
The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.