Difference between local and global indexes in DynamoDB
Here is the formal definition from the documentation:
Global secondary index — an index with a hash and range key that can be different from those on the table. A global secondary index is considered "global" because queries on the index can span all of the data in a table, across all partitions.
Local secondary index — an index that has the same hash key as the table, but a different range key. A local secondary index is "local" in the sense that every partition of a local secondary index is scoped to a table partition that has the same hash key.
However, the differences go way beyond the possibilities in terms of key definitions. Find below some important factors that will directly impact the cost and effort for maintaining the indexes:
- Throughput :
Local Secondary Indexes consume throughput from the table. When you query records via the local index, the operation consumes read capacity units from the table. When you perform a write operation (create, update, delete) in a table that has a local index, there will be two write operations, one for the table another for the index. Both operations will consume write capacity units from the table.
Global Secondary Indexes have their own provisioned throughput, when you query the index the operation will consume read capacity from the index, when you perform a write operation (create, update, delete) in a table that has a global index, there will be two write operations, one for the table another for the index*.
*When defining the provisioned throughput for the Global Secondary Index, make sure you pay special attention to the following requirements:
In order for a table write to succeed, the provisioned throughput settings for the table and all of its global secondary indexes must have enough write capacity to accommodate the write; otherwise, the write to the table will be throttled.
- Management :
Local Secondary Indexes can only be created when you are creating the table, there is no way to add Local Secondary Index to an existing table, also once you create the index you cannot delete it.
Global Secondary Indexes can be created when you create the table and added to an existing table, deleting an existing Global Secondary Index is also allowed.
- Read Consistency:
Local Secondary Indexes support eventual or strong consistency, whereas, Global Secondary Index only supports eventual consistency.
- Projection:
Local Secondary Indexes allow retrieving attributes that are not projected to the index (although with additional cost: performance and consumed capacity units). With Global Secondary Index you can only retrieve the attributes projected to the index.
Special Consideration about the Uniqueness of the Keys Defined to Secondary Indexes:
In a Local Secondary Index, the range key value DOES NOT need to be unique for a given hash key value, same thing applies to Global Secondary Indexes, the key values (Hash and Range) DO NOT need to be unique.
Source: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SecondaryIndexes.html
These are the possible searches by index:
- By Hash
- By Hash + Range
- By Hash + Local Index
- By Global index
- By Global index + Range Index
Hash and Range indexes of a table: These are the usual indexes of previous versions of the Amazon AWS SDK.
Global and Local indexes: These are 'additional' indexes created on a table, in addition to existing hash and range indexes of the table. Global index is similar to a hash. Range index behave similarly to the range index used with the hash of the table. In you entity model in your code, the getter must be annotated in this way:
For global indexes:
@DynamoDBIndexHashKey(globalSecondaryIndexName = INDEX_GLOBAL_RANGE_US_TS) @DynamoDBAttribute(attributeName = PROPERTY_USER) public String getUser() { return user; }
For range index associated to the global index:
@DynamoDBIndexRangeKey(globalSecondaryIndexName = INDEX_GLOBAL_RANGE_US_TS) @DynamoDBAttribute(attributeName = PROPERTY_TIMESTAMP) public String getTimestamp() { return timestamp; }
Besides, if you read a table by a Global index, it must be an Eventual read (not Consistent read):
queryExpression.setConsistentRead(false);
One way to put it is this:
LSI - allows you to perform a query on a single Hash-Key while using multiple different attributes to "filter" or restrict the query.
GSI - allows you to perform queries on multiple Hash-Keys in a table, but costs extra in throughput, as a result.
A more extensive breakdown of the table types and how they work, below:
Hash Only
As you probably already know; a Hash-Key by itself must be unique as writing to a Hash-Key that already exists will overwrite the existing data.
Hash+Range
A Hash-Key + Range-Key allows you to have multiple Hash Keys that are the same, as long as they have a different range key. In this case, if you write to a Hash-Key that already exists, but use a Range-Key that is not already used by that Hash-Key, it makes a new item, whereas if an item with the same Hash+Range combination already exists, it overwrites the matching item.
Another way to think of this is like a file with a format. You can have a file with the same name (hash) as another, in the same folder (table), as long as their format (range) is different. Likewise, you can have multiple files of the same format as long as their name is different.
LSI
An LSI is basically the same as a Hash-Key + Range-Key, and follows the same rules as it, when creating items, except that you must also provide values for the LSIs, as well; they cannot be left empty/null.
To say an LSI is "Range-Key 2" is not entirely correct as you cannot have (using my file and format analogy from earlier) a file named: file.format.lsi
and file.format.lsi2
. You can, however, have file.format.lsi
and file.format2.lsi
or file.format.lsi
and file2.format.lsi
.
Basically, an LSI is just a "Filter-key", not an actual Range-Key; your base Hash and Range value combination must still be unique while the LSI values do not have to be unique, at all. An easier way to look at it may be to think of the LSI as data within the files. You could write code that finds all the files with the name "PROJECT101", regardless of their fileFormat
, then reads the data inside to determine what should be included in the query and what is omitted. This is basically how LSI works (just without the extra overhead of opening the file to read its contents).
GSI
For GSI, you're essentially creating another table for each GSI, but without the hassle of maintaining multiple separate tables that mirror data between them; this is why they cost more throughput.
So for a GSI, you could specify fileName
as your base Hash-Key, and fileFormat
as your base Range-Key. You can then specify a GSI that has a Hash-Key of fileName2
and a Range-Key of fileFormat2
. You can then query on either fileName
or fileName2
if you like, unlike LSI where you can only query on fileName
.
The main advantages are that you only have to maintain one table, instead of 2, and anytime you write to either the primary Hash/Range or the GSI Hash/Range(s), the other(s) will automatically be updated as well, so you can't "forget" to update the other table(s) like you can with a multi-table setup. Also, there's no chance of a lost connection after updating one and before updating the other, like there is with the multi-table setup.
Additionally, a GSI can "overlap" the base Hash/Range combination. So if you wanted to make a table with fileName
and fileFormat
as your base Hash/Range and filePriority
and fileName
as your GSI, you can.
Lastly, a GSI Hash+Range combination does not have to be unique, while the base Hash+Range combination does have to be unique. This is something that is not possible with a dual/multi table setup, but is with GSI. As a result, you MUST provide values for both the base AND GSI Hash+Range, when updating; none of these values can be empty/null.
Local Secondary Indexes still rely on the original Hash Key. When you supply a table with hash+range, think about the LSI as hash+range1, hash+range2.. hash+range6. You get 5 more range attributes to query on. Also, there is only one provisioned throughput.
Global Secondary Indexes defines a new paradigm - different hash/range keys per index.
This breaks the original usage of one hash key per table.
This is also why when defining GSI you are required to add a provisioned throughput per index and pay for it.
More detailed information about the differences can be found in the GSI announcement