Difference between partition key, composite key and clustering key in Cassandra?
There is a lot of confusion around this, I will try to make it as simple as possible.
The primary key is a general concept to indicate one or more columns used to retrieve data from a Table.
The primary key may be SIMPLE and even declared inline:
create table stackoverflow_simple (
key text PRIMARY KEY,
data text
);
That means that it is made by a single column.
But the primary key can also be COMPOSITE (aka COMPOUND), generated from more columns.
create table stackoverflow_composite (
key_part_one text,
key_part_two int,
data text,
PRIMARY KEY(key_part_one, key_part_two)
);
In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is the CLUSTERING KEY (in this example key_part_two)
Please note that both partition and clustering key can be made by more columns, here's how:
create table stackoverflow_multiple (
k_part_one text,
k_part_two int,
k_clust_one text,
k_clust_two int,
k_clust_three uuid,
data text,
PRIMARY KEY((k_part_one, k_part_two), k_clust_one, k_clust_two, k_clust_three)
);
Behind these names ...
- The Partition Key is responsible for data distribution across your nodes.
- The Clustering Key is responsible for data sorting within the partition.
- The Primary Key is equivalent to the Partition Key in a single-field-key table (i.e. Simple).
- The Composite/Compound Key is just any multiple-column key
Further usage information: DATASTAX DOCUMENTATION
Small usage and content examples
***SIMPLE*** KEY:
insert into stackoverflow_simple (key, data) VALUES ('han', 'solo');
select * from stackoverflow_simple where key='han';
table content
key | data
----+------
han | solo
COMPOSITE/COMPOUND KEY can retrieve "wide rows" (i.e. you can query by just the partition key, even if you have clustering keys defined)
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 9, 'football player');
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 10, 'ex-football player');
select * from stackoverflow_composite where key_part_one = 'ronaldo';
table content
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 9 | football player
ronaldo | 10 | ex-football player
But you can query with all keys (both partition and clustering) ...
select * from stackoverflow_composite
where key_part_one = 'ronaldo' and key_part_two = 10;
query output
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 10 | ex-football player
Important note: the partition key is the minimum-specifier needed to perform a query using a where clause
.
If you have a composite partition key, like the following
eg: PRIMARY KEY((col1, col2), col10, col4))
You can perform query only by passing at least both col1 and col2, these are the 2 columns that define the partition key. The "general" rule to make query is you must pass at least all partition key columns, then you can add optionally each clustering key in the order they're set.
so, the valid queries are (excluding secondary indexes)
- col1 and col2
- col1 and col2 and col10
- col1 and col2 and col10 and col 4
Invalid:
- col1 and col2 and col4
- anything that does not contain both col1 and col2
In Cassandra, the difference between primary key, partition key, composite key, clustering key always makes some confusion. So, I am going to explain below and co relate to each other's. We use CQL (Cassandra Query Language) for Cassandra database access. Note: - Answer is as per updated version of Cassandra.
Primary Key: - In Cassandra there are 2 different ways to use primary Key.
CREATE TABLE Cass (
id int PRIMARY KEY,
name text
);
Create Table Cass (
id int,
name text,
PRIMARY KEY(id)
);
In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key having property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns form the partition key.
Partitioning and Clustering The PRIMARY KEY definition is made up of two parts: the Partition Key and the Clustering Columns. The first part maps to the storage engine row key, while the second is used to group columns in a row.
CREATE TABLE device_check (
device_id int,
checked_at timestamp,
is_power boolean,
is_locked boolean,
PRIMARY KEY (device_id, checked_at)
);
Here device_id is partition key and checked_at is cluster_key.
We can have multiple cluster keys as well as partition keys too which depends on declaration.
Adding a summary answer as the accepted one is quite long. The terms "row" and "column" are used in the context of CQL, not how Cassandra is actually implemented.
- A primary key uniquely identifies a row.
- A composite key is a key formed from multiple columns.
- A partition key is the primary lookup to find a set of rows, i.e. a partition.
- A clustering key is the part of the primary key that isn't the partition key (and defines the ordering within a partition).
Examples:
PRIMARY KEY (a)
: The partition key isa
.PRIMARY KEY (a, b)
: The partition key isa
, the clustering key isb
.PRIMARY KEY ((a, b))
: The composite partition key is(a, b)
.PRIMARY KEY (a, b, c)
: The partition key isa
, the composite clustering key is(b, c)
.PRIMARY KEY ((a, b), c)
: The composite partition key is(a, b)
, the clustering key isc
.PRIMARY KEY ((a, b), c, d)
: The composite partition key is(a, b)
, the composite clustering key is(c, d)
.