Generating v5 UUID. What is name and namespace?
Name and namespace can be used to create a hierarchy of (very probably) unique UUIDs.
Roughly speaking, a type 3 or type 5 UUID is generated by hashing together a namespace identifier with a name. Type 3 UUIDs use MD5 and type 5 UUIDs use SHA1. Only 128-bits are available and 5 bits are used to specify the type, so all of the hash bits don't make it into the UUID. (Also MD5 is considered cryptographically broken, and SHA1 is on its last legs, so don't use this to verify data that needs to be "very secure"). That said, it gives you a way of creating a repeatable/verifiable "hash" function mapping a possibly hierarchical name onto a probabilistically unique 128-bit value, potentially acting like a hierarchical hash or MAC.
Suppose you have a (key,value) store, but it only supports one namespace. You can generate a large number of distinct logical namespaces using type 3 or type 5 UUIDs. First, create a root UUID for each namespace. This could be a type 1 (host+timestamp) or type 4 (random) UUID so long as you stash it somewhere. Alternatively you could create one random UUID for your root (or use the null
UUID: 00000000-0000-0000-0000-000000000000
as root) and then create a reproducible UUID for each namespace using "uuid -v5 $ROOTUUID $NAMESPACENAME
". Now you can create unique UUIDs for keys within a namespace using "uuid -v5 $NAMESPACEUUID $KEY
". These UUIDs can be thrown into a single key-value store with high probability of avoiding collision. This process can be repeated recursively so that if for instance the "value" associated with a UUID key in turn represents some sort of logical "namespace" like a bucket, container or directory, then its UUID can be used in turn to generate more hierarchical UUIDs.
The generated type 3 or type 5 UUID holds a (partial) hash of the namespace id and name-within-namespace (key). It no more holds the namespace UUID than does a message MAC hold the contents of the message it is encoded from. The name is an "arbitrary" (octet) string from the perspective of the uuid algorithm. Its meaning however depends on your application. It could be a filename within a logical directory, object-id within an object-store, etcetera.
While this works well for a moderately large number of namespaces and keys, it eventually runs out of steam if you are aiming for a very large numbers of keys that are unique with very high probability. The Wikipedia entry for the Birthday Problem (aka Birthday Paradox) includes a table that gives the probabilities of at least one collision for various numbers of keys and table sizes. For 128-bits, hashing 26 billion keys this way has a probability of collision of p=10^-18
(negligible), but 26 trillion keys, increases the probability of at least one collision to p=10^-12
(one in a trillion), and hashing 26*10^15
keys, increases the probability of at least one collision to p=10^-6
(one in a million). Adjusting for 5 bits that encode the UUID type, it will run out somewhat faster, so a trillion keys have roughly a 1-in-a-trillion chance of having a single collision.
See http://en.wikipedia.org/wiki/Birthday_problem#Probability_table for the probability table.
See http://www.ietf.org/rfc/rfc4122.txt for more details on UUID encodings.
Type 3 and Type 5 UUIDs are just a technique of stuffing a hash into a UUID:
- Type 1: stuffs MAC address+datetime into 128 bits
- Type 3: stuffs an MD5 hash into 128 bits
- Type 4: stuffs random data into 128 bits
- Type 5: stuffs an SHA1 hash into 128 bits
- Type 6: unofficial idea for sequential UUIDs
Edit: Unofficial type 6 now has an official rfc
An SHA1 hash outputs 160 bits (20 bytes); the result of the hash is converted into a UUID.
With a 20-byte digest from SHA1:
SHA1 Digest: 74738ff5 5367 e958 1aee 98fffdcd1876 94028007
UUID (v5): 74738ff5-5367-5958-9aee-98fffdcd1876
⭡ ⬑first two bits set to 1 and 0, respectively
╰─low nibble is set to 5, to indicate type 5
What do I hash?
You're probably wondering what is it that I'm supposed to hash. Basically you hash the concatenation of:
sha1(NamespaceUUID+AnyString);
You prefix your string with a so-called namespace to prevent name conflicts.
The UUID RFC pre-defines four namespaces for you:
NameSpace_DNS
: {6ba7b810-9dad-11d1-80b4-00c04fd430c8}NameSpace_URL
: {6ba7b811-9dad-11d1-80b4-00c04fd430c8}NameSpace_OID
: {6ba7b812-9dad-11d1-80b4-00c04fd430c8}NameSpace_X500
:{6ba7b814-9dad-11d1-80b4-00c04fd430c8}
So, you could hash together:
StackOverflowDnsUUID = sha1(Namespace_DNS + "stackoverflow.com");
StackOverflowUrlUUID = sha1(Namespace_URL + "stackoverflow.com");
The RFC then defines how to:
- take the 160 bits from SHA1
- and convert it into 128 bits of a UUID
The basic gist is to only take the first 128 bits, stuff a 5
in the type record, and then set the first two bits of the clock_seq_hi_and_reserved
section to 1 and 0, respectively.
More examples
Now that you have a function that generates a so-called Name , you can have the function (in pseudo-code):
UUID NameToUUID(UUID NamespaceUUID, String Name)
{
//Note: All code on stackoverflow is public domain - no attribution required.
Byte[] hash = sha1(NamespaceUUID.ToBytes() + Name.ToBytes());
Uuid result;
//Copy first 16-bytes of the hash into our Uuid result
Copy(hash, result, 16);
//set high-nibble to 5 to indicate type 5
result[6] &= 0x0F;
result[6] |= 0x50;
//set upper two bits to "10"
result[8] &= 0x3F;
result[8] |= 0x80;
return result;
}
(Note: the endian-ness of your system can affect indices of the above bytes)
Now you can have calls:
uuid = NameToUUID(Namespace_DNS, 'www.stackoverflow.com');
uuid = NameToUUID(Namespace_DNS, 'www.google.com');
uuid = NameToUUID(Namespace_URL, 'http://www.stackoverflow.com');
uuid = NameToUUID(Namespace_URL, 'http://www.google.com/search&q=rfc+4112');
uuid = NameToUUID(Namespace_URL, 'http://stackoverflow.com/questions/5515880/test-vectors-for-uuid-version-5-converting-hash-into-guid-algorithm');
Now back to your question
For version 3 and version 5 UUIDs the additional command line arguments namespace and name have to be given. The namespace is either a UUID in string representation or an identifier for internally pre-defined namespace UUIDs (currently known are "ns:DNS", "ns:URL", "ns:OID", and "ns:X500"). The name is a string of arbitrary length.
The namespace is whatever UUID you like. It can be one of the pre-defined ones, or you can make up your own, e.g.1:
UUID Namespace_RectalForeignExtractedObject = '8e884ace-bee4-11e4-8dfc-aa07a5b093db'
The name is a string of arbitrary length.
The name is just the text you want to have appended to the namespace, then hashed, and stuffed into a UUID:
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'screwdriver');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'toothbrush');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'broomstick');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'orange');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'axe handle');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'impulse body spray');
uuid = NameToUUID('8e884ace-bee4-11e4-8dfc-aa07a5b093db', 'iPod Touch');
A name is nothing more than an identifier that is unique within some namespace. The problem is that namespaces are often quite small and the names in one often collide with names in others. For instance, my car's license plate number (name) is unique within my state DMV's namespace, but it's probably not unique in the world; other state DMVs may have used the same name in their own namespaces. Heck, someone else may have a phone number (name) that also matches because that's yet another namespace, etc.
UUIDs can be seen as inhabiting a single namespace so vast that it can provide a unique name for everything; that's what the "universal" means. But how do you map existing names in other namespaces to a UUID?
One obvious solution is to generate a UUID (V1 or V4) for every item to replace the old names in their disjoint namespaces. The downside is that they're a lot bigger, you have to communicate all the new names to everyone who has a copy of your dataset, update all your APIs, etc. Odds are you can't actually get rid of the old names entirely anyway, which means now every item has two names, so did you make things better or worse?
This is where V3/V5 come in. The UUIDs look just as random as V4 but are actually deterministic; anyone who has the right UUID for a namespace can then independently generate the same UUID for any given name within that namespace. You don't need to publish them at all nor even pre-generate them since anyone can create them on the fly as needed!
DNS names and URLs are very commonly used namespaces, so standard UUIDs were published for those; ASN.1 OIDs and X.500 names aren't as common, but standards bodies love them, so they published standard namespace UUIDs for them too.
For all other namespaces, you have to generate your own namespace UUID (V1 or V4) and communicate it to anyone who needs it. If you have several namespaces, having to publish the UUID for each one is clearly not ideal.
This is where the hierarchy comes in: you create one "base" UUID (of whatever type), and then use that as a namespace for naming your other namespaces! That way, you only have to publish the base UUID (or use an obvious one), and everyone can calculate the rest.
For instance, let's stay we wanted to create some UUIDs for StackOverflow; that has an obvious name within the DNS namespace, so the base is obvious:
uuid ns_dns = '6ba7b810-9dad-11d1-80b4-00c04fd430c8';
uuid ns_base = uuidv5(ns_dns, 'stackoverflow.com');
StackOverflow itself has separate namespaces for users, questions, answers, comments, etc., but those are fairly obvious as well:
uuid ns_user = uuidv5( ns_base, 'user' );
uuid ns_question = uuidv5( ns_base, 'question' );
uuid ns_answer = uuidv5( ns_base, 'answer' );
uuid ns_comment = uuidv5( ns_base, 'comment' );
This particular question is #10867405, so its UUID would be:
uuid here = uuidv5(ns_question, '10867405');
Notice that there's nothing random in this process, so anyone who follows the same logic will get the same answer, yet the UUID namespace is so vast that it will (effectively, given the security of a 122-bit cryptographic hash) never collide with a UUID generated from any other namespace/name pair.