Does the order of columns in a PK index matter?
Does the order of columns in a PK index matter?
Yes it does.
By default, the primary key constraint is enforced in SQL Server by a unique clustered index. The clustered index defines the logical order of rows in the table. There may be a number of extra index pages added to represent the upper levels of the b-tree index, but the lowest (leaf) level of a clustered index is simply the logical order of the data itself.
To be clear about it, rows on a page are not necessarily physically stored in clustered index key order. There is a separate indirection structure within the page that stores a pointer to each row. This structure is sorted by the clustered index keys. Also, each page has a pointer to the previous and next page at the same level in clustered index key order.
With a clustered primary key of (RowNumber, DataDate)
, the rows are logically sorted first by RowNumber
and then by DataDate
- so all rows where RowNumber = 1
are logically grouped together, then rows where RowNumber = 2
and so on.
When you add new data (with RowNumbers
from 1 to n) the new rows logically belong inside the existing pages, so SQL Server will likely have to do a lot of work splitting pages to make room. All this activity generates a lot of extra work (including logging the changes) for no gain.
Split pages also start off about 50% empty, so excessive splitting can result in low page density (fewer rows than optimal per page) as well. Not only is this bad news for reading from disk (lower density = more pages to read), the lower-density pages also take up more room in memory when cached.
Changing the clustered index to (DataDate, RowNumber
) means that new data (with, presumably, higher DataDates
than currently stored) is appended to the logical end of the clustered index on fresh pages. This will remove the unnecessary overheads of splitting pages and result in faster load times. Less fragmented data also means that read-ahead activity (reading pages from disk just before they are needed for an in-progress query) can be more efficient.
If nothing else, your queries are much more likely to search on DataDate
than RowNumber
. A clustered index on (DataDate, RowNumber
) supports index seeks on DataDate
(and then RowNumber
). The existing arrangement only supports seeks on RowNumber
(and only then, perhaps, on DataDate
). You might well be able to drop the existing nonclustered index on DataDate
once the primary key is changed. The clustered index will be wider than the nonclustered index it replaces, so you should test to ensure that performance remains acceptable.
When importing new data with bcp
, you may get higher performance if the data within the import file is sorted by the clustered index keys (ideally (DataDate, RowNumber
)) and you specify the bcp
option:
-h "ORDER(DataDate,RowNumber), TABLOCK"
For best data loading performance, you might try to achieve minimally-logged inserts. For more information, see:
- SQL Server Index Basics by Robert Sheldon
- Effective Clustered Indexes by Michelle Ufford
- Bulk Inserts Via TSQL by Robert Sheldon
- Minimal Logging with INSERT…SELECT into Empty Clustered Tables by me
- Minimal Logging with INSERT…SELECT and Fast Load Context by me
Yes, the order is critical. I highly doubt you ever query by RowNumber (eg WHERE RowNumber=1
). Overwhelmingly time series are queried by date (WHERE DataDate BEWEEN @start AND @end
) and such queries would require a clustered organization by DataDate
.
Fragmentation in general is a red-herring. Reducing fragmentation should not be your goal here, but having a proper organization for your queries should. Getting reduced fragmentation in addition is a good think to have, but is not a goal on its own. If you have a properly organized data model that matches your workload (your queries are properly covered) and you have measurements that show fragmentation as impacting performance then we can talk about it.