Why don't databases create their own indexes automatically?
Update
This is now implemented in SQL Server Azure. It generates recommendations
and index management can be configured to be automatic.
Enable automatic index management
You can set the SQL Database Advisor to implement recommendations automatically. As recommendations become available they will automatically be applied. As with all index operations managed by the service if the performance impact is negative the recommendation will be reverted.
Original Answer
Some databases do already (kind of) create indexes automatically.
In SQL Server the execution plan can sometimes include an Index Spool operator where the RDBMS dynamically creates an indexed copy of the data. However this spool is not a persistent part of the database kept in synch with the source data and it cannot be shared between query executions, meaning execution of such plans may end up repeatedly creating and dropping temporary indexes on the same data.
Perhaps in the future RDBMSs will have the capacity to dynamically drop and create persistent indexes according to workload.
The process of index optimisation is in the end just a cost benefit analysis. Whilst it is true that humans may have more information about relative importance of queries in a workload in principle there is no reason why this information could not be made available to the optimiser. SQL Server already has a resource governor that allows sessions to be classified into different workload groups with different resource allocations according to priority.
The missing index DMVs mentioned by Kenneth are not intended to be implemented blindly as they only consider the benefits to a specific query and make no attempt to take account of the cost of the potential index to other queries. Nor does it consolidate similar missing indexes. e.g. the output of this DMV may report missing indexes on A,B,C
and A,B INCLUDE(C)
Some current issues with the idea are
- The quality of any automated analysis that does not actually create the index will be highly dependant upon the accuracy of the costing model.
- Even within the field of automated analysis an offline solution will be able to be more thorough than an online solution as it is imperative that an online solution should not add large book keeping overhead to the live server and interfere with its primary purpose of executing queries.
- The indexes created automatically in response to workload will necessarily be created in response to queries that would have found them useful so will lag behind solutions that create the indexes in advance.
It is probably reasonable to expect the accuracy of costing models to improve over time but point 2 looks trickier to solve and point 3 is inherently insoluble.
Nevertheless probably the vast majority of installs are not in this idealised situation with skilled staff who continuously monitor, diagnose, and anticipate (or at least react to) changes in workloads.
The AutoAdmin project at Microsoft Research has been running since 1996
The goal of this project is to make databases self-tuning and self-administering by exploiting knowledge of the workload
The project home page lists several intriguing projects. One is particularly relevant to the question here
Another interesting problem arises when there is no DBA available (e.g. an embedded database or a small business). In such scenarios, a low touch continuous index tuning approach may become important. We have explored solutions ...[in] “An Online Approach to Physical Design Tuning” in ICDE 2007.
The authors state
With increasingly common DBMS features like online indexes , it is appealing to explore more automatic solutions to the physical design problem that advance the state of the art.
The paper introduces an algorithm
Its main characteristics are:
- As queries are optimized, we identify a relevant set of candidate indexes that would improve performance. This feature allows query processing to continue in parallel with indexes that are built in the background.
- At execution time, we track the potential benefits that we lose by not having such candidate indexes and also the utility of existing indexes in the presence of queries, updates, and space constraints.
- After we gather enough “evidence” that a physical design change is beneficial, we automatically trigger index creations or deletions.
- The online nature of our problem implies that we will generally lag behind optimal solutions that know the future. However, by carefully measuring evidence, we ensure that we do not suffer from “late” decisions significantly, thus bounding the amount of incurred loss
The implementation of the algorithm allows for throttling in response to changes in server load and also can abort index creation if during creation the workload changes and expected benefit falls below the point that it is deemed worthwhile.
The conclusion of the authors on the topic of Online versus traditional physical tuning.
The online algorithms in this work are useful when DBAs are uncertain about the future behavior of the workload, or have no possibility of doing a comprehensive analysis or modelling. If a DBA has full information about the workload characteristics, then a static analysis and deployment by existing tools (e.g., [2, 3]) would be a better alternative.
The conclusions here are similar to those in another paper Autonomous Query-driven Index Tuning
Our approach cannot beat the index advisor if the whole workload is known in advance. However, in dynamic environments with evolving and changing workloads the query-driven approach produces better results.
The index design that you put in place is something more of an art than a science. The RDBMS isn't smart enough to take common workloads and design a smart indexing strategy. It is up to human intervention (read: DBA) to analyze workload and determine what is the best approach.
If there was no penalty of having indexes then it would be a shotgun approach to just add an infinite number of indexes. But because data modification (INSERTS, UPDATES, and DELETES) have impact on the enabled indexes on a table then there is going to be that variable overhead of these indexes.
It takes human design and strategy to smartly create indexes that'll maximize read performance, while having the least amount of data modification overhead.
In fact, there are some databases that do this. For example, Google's BigTable and Amazon's SimpleDB automatically create indices (though neither are RDBMS's). There is also at least one MySQL RDBMS engine that does this. SQL Server also keeps track of indices it thinks you should create, though it doesn't go so far as actually creating them.
The problem is surprisingly difficult to get correct, so it's no wonder that most databases don't automatically create them (BigTable/SimpleDB get away with it because they don't allow arbitrary joins, which makes things significantly easier). Also, creating indices on the fly is a time-consuming process that requires exclusive access to the entire table - definitely not something you want happening while the table is on-line.
However, given the number of LAMP web applications out there that were written by amateurs who don't even know what an index is, I still think this feature would be beneficial for some people.