What is partition key in AWS Kinesis all about?
Partition keys only matter when you have multiple shards in a stream (but they're required always). Kinesis computes the MD5 hash of a partition key to decide what shard to store the record on (if you describe the stream you'll see the hash range as part of the shard decription).
So why does this matter?
Each shard can only accept 1,000 records and/or 1 MB per second (see PutRecord doc). If you write to a single shard faster than this rate you'll get a ProvisionedThroughputExceededException
.
With multiple shards, you scale this limit: 4 shards gives you 4,000 records and/or 4 MB per second. Of course, there are caveats.
The biggest is that you must use different partition keys. If all of your records use the same partition key then you're still writing to a single shard, because they'll all have the same hash value. How you solve this depends on your application: if you're writing from multiple processes then it might be sufficient to use the process ID, server's IP address, or hostname. If you're writing from a single process then you can either use information that's in the record (for example, a unique record ID) or generate a random string.
Second caveat is that the partition key counts against the total write size, and is stored in the stream. So while you could probably get good randomness by using some textual component in the record, you'd be wasting space. On the other hand, if you have some random textual component, you could calculate your own hash from it and then stringify that for the partition key.
Lastly, if you're using PutRecords (which you should, if you're writing a lot of data), individual records in the request may be rejected while others are accepted. This happens because those records went to a shard that was already at its write limits, and you have to re-send them (after a delay).
The other answer points out that records are ordered within a partition, and claims that this is the real reason for a partition key. However, this ordering reflects the order in which Kinesis accepted the records, which is not necessarily the order that the client intended.
- If the client is single-threaded and uses the PutRecord API, then yes, ordering should be consistent between client and partition.
- If the client is multi-threaded, then all of the standard distributed systems causes of disorder (internal thread scheduling, network routing, service scheduling) could cause inconsistent ordering.
- If the client uses the PutRecords API, individual records from a batch can be rejected and must be resent. The doc is very clear that this API call does not preserve ordering. And in a high-volume environment, this is the API that you will use.
In addition to inconsistent ordering when writing, a reshard operation introduces the potential for inconsistencies when reading. You must follow the chain from parent to child(ren), recognizing that there may be more or fewer children and that the split may not be even. A naive "one thread per shard" approach (such as used by Lambda) won't work.
So, bottom line: yes, shards provide ordering. However, relying on that order may introduce hard-to-diagnose bugs into your application.
In most cases, it won't matter. But if you need guaranteed order (such as when processing a transaction log), then you must add your own ordering information into your record when it's written, and ensure that records are properly ordered when read.
The accepted answer explains what are partition keys and and what they're used for in Kinesis (to decide to which shard to send the data to). Unfortunately, it does not explain why partition keys are needed in the first place.
In theory AWS could create a random partition key for each record which will result a near-perfect spread.
The real reason partitions are used is for "ordering/streaming". Kinesis maintains ordering (sequence number) for each shard.
In other words, by streaming X and afterwards Y to shard Z it is guaranteed, that X will be pulled from the stream before Y (when pulling records from all shards). On the other hand, by streaming X to shard Z1 and afterwards Y to shard Z2 there is no guarantee on the ordering (when pulling records from all shards). Y may definitely be pulled before X.
The shard "streaming" capability is useful in many cases.
(E.g. a video service streaming a movie to a user using the username and the movie name as the partition key).
(E.g. working on a stream of common events, and applying aggregation).
In cases where ordering (streaming) or grouping (e.g aggregation) is not required, generating a random partition key will suffice.