When using UUIDs, should I also use AUTO_INCREMENT?
Once you have a big distributed data warehouse, if you use UUID or GUID as unique key and use it in join later on, it is not good.
Instead of using UUID or GUID, please create sequential surrogate key in your master database or in your data pipeline.
Share our project experience as a reference. We have 300 Billion records saved in parallel data warehouse, in our system, auto incremental key even not supported. We use 8 bytes bigint as primary key (actually unique key in our system is not supported either, but that's not affect logical uniqueness), when we processing file and load file, we use 3 bytes to generate file ID, which is 2^24 files, we have about 2,000 files need to load per day, so, 2^24 can support about 25 years, if it is not wrong.
We use the rest of 4 bytes as row id, which is 4 billion rows, we don't have 4 billion rows in any file. We reserve 1 byte. During the ETL processing, we only need to track the file ID in the master database, which support auto incremental ID, when we need to generate record ID when processing file, we combine FileID+reserve 1 byte+4 bytes rowID.
From https://uuid.fyi/uuidorint
UUID
Pros
- Globally unique.
- Stateless, it can be generated on the fly.
- Secure since malicious user can't guess the ID.
- Version 1 UUID stores timestamp info, could be useful sometimes.
Cons
- Not readable.
- For database like MySQL, Oracle, which uses clustered primary key, version 4 randomly generated UUID will hurt insertion performance if used as the primary key. This is because it requires reordering the rows in order to place the newly inserted row at the right position inside the clustered index. On the other hand, PostgreSQL uses heap instead of clustered primary key, thus using UUID as the PK won't impact PostgreSQL's insertion performance.
Auto Increment Integer/Sequence
Pros
- Readable. This is especially valuable if we would expose it externally. Thinking of issue id, obviously, issue-123 is much more readable than issue-b1e92c3b-a44a-4856-9fe3-925444ac4c23.
Cons
- It can't be used in the distirbuted system since it's quite likely that different hosts could produce exactly the same number.
- It can't be generated on the fly. Instead, we must consult the database to figure out the next available PK.
- Some business data can be exposed, since the latest ID could represent the total number of inventory. Attackers can also scan the integer range to explore leakage (though it shouldn't happen if ACL is implemented correctly).
Using autoincrements as primary plus an uuid column is a valid model, but you would still have to struggle with some problems the autoincrements brings, it all depends on how you do the synchros.
Anyway I've been working with uuid's as primary keys (my current database have half a million records) and it's still pretty fast, it only slow downs a bit on the inserts, but unless you have very high volumes of inserts daily it shouldn't scare you.
If you use Sql-Server another solution you could have a look at is the Sequential UUIDs, which have a slightly greater collision chances than normal UUID's, but the absolute collision chances are still pretty low, and as they are partially sequential that covers the problems with the fragmentation.