Lookup performance of Numeric vs String
numeric(20)
(14 bytes) is bigger and slower than bigint
(8 bytes) in every respect. And varchar
or text
(same thing), occupy 24 bytes for 20 digits: slower, yet, than numeric
. Plus, character types are encumbered with COLLATION
rules unless explicitly defined without.
Test yourself:
SELECT pg_column_size((2^64 -1)::numeric) -- 14 bytes
, pg_column_size((2^64 -1)::numeric::text) -- 24 bytes
numeric
also disallows non-numbers from being stored out of the box (suits your needs).
Faced with your given options (values between 2^63 and 2^64, so too big to fit inside bigint
) I would choose numeric
and never look back.
Related:
- What is the overhead for varchar(n)?
Or you could install the extension pguint
by Peter Eisentraut, one of the core hackers of the Postgres project.
Be sure to read the readme first. The extension provides several additional (most of them unsigned) integer types. You might just extraxt uint8
(unsigned 64-bit integer) and ditch the rest to avoid over-crowding the type system.
Hate to be captain obvious on this one, but Instagram generously provides a function that you linked to that stores the keys as bigint
.
CREATE SCHEMA insta5;
CREATE SEQUENCE insta5.table_id_seq;
CREATE OR REPLACE FUNCTION insta5.next_id(OUT result bigint) AS $$
DECLARE
our_epoch bigint := 1314220021721;
seq_id bigint;
now_millis bigint;
shard_id int := 5;
BEGIN
-- The %1024, is just a way of saying they only want 10bit wraparound.
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
END;
$$ LANGUAGE plpgsql;
They're actually using PostgreSQL. From that function you can see that they're returning a bigint
. So certainly you can store the result of that function in bigint
. As a special note, this is likely not the function that they're using. That function probably has a signature more like this,
insta5.next_id(smallint shard, OUT result bigint);
We know this because hardcoding a shard of 5
is not all that useful, and they seem to indicate that they're using this functionality. So in that blog id they brag that their ID compromises of
- 64bits total
- 64-23 = 41 bits for timestamp
- 64-41 = 23 bits for shard + sequence id
- 10 bits for the sequence id.
- 13 bits for the shard.
Quick test on their code,
test=# SELECT insta5.next_id();
next_id
---------------------
1671372309237077023
(1 row)
Decomposing the ID
Now let's play. For teh extra sexy, we can create helper functions that get the internal components from the ID. In the event you want to know the shard Instagram is using or their internal timestamp.
-- 13 bits for shard
CREATE FUNCTION insta5.get_shard(id bigint)
RETURNS smallint
AS $$
SELECT ((id<<41)>>51)::smallint;
$$ LANGUAGE sql;
-- 10 bits for sequence id
CREATE FUNCTION insta5.get_sequence(id bigint)
RETURNS smallint
AS $$
SELECT ((id<<54)>>54)::smallint;
$$ LANGUAGE sql;
-- 41 bits for timestamp
CREATE OR REPLACE FUNCTION insta5.get_ts(id bigint)
RETURNS timestamp without time zone
AS $$
SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$$ LANGUAGE sql;
Playing around, let's get a test id.
SELECT insta5.next_id();
next_id
---------------------
1671390786412876801
(1 row)
SELECT
insta5id,
insta5.get_ts(insta5id),
insta5.get_shard(insta5id),
insta5.get_sequence(insta5id)
FROM (VALUES
(1671390786412876801::bigint),
(insta5.next_id())
) AS t(insta5id);
Returns the following,
insta5id | get_ts | get_shard | get_sequence
---------------------+---------------------+-----------+--------------
1671390786412876801 | 2017-12-16 17:02:09 | 5 | 1
1671392537048257538 | 2017-12-16 17:05:38 | 5 | 2
(2 rows)
Rolling our own Instagram ID Domain
You can even create an explicit DOMAIN
over the type if you wish to really clean this up.. This is how I would personally store this, note I made a few further modifications.
- I added
COMMENTS
-- always good practice. - Made the functions
IMMUTABLE
- Added
insta5.next_id
require an explicit shard.
Let's drop what we had,
DROP SCHEMA insta5 CASCADE;
And start over,
CREATE SCHEMA insta5;
COMMENT ON SCHEMA insta5 IS 'Instagram';
CREATE DOMAIN insta5.id AS bigint;
COMMENT ON DOMAIN insta5.id IS $$Instagram's internal ID type, based on example from "Sharding & IDs at Instagram"$$;
CREATE SEQUENCE insta5.table_id_seq;
CREATE OR REPLACE FUNCTION insta5.next_id(shard_id smallint)
RETURNS insta5.id
AS $$
DECLARE
our_epoch bigint := 1314220021721;
seq_id bigint;
result insta5.id;
now_millis bigint;
BEGIN
SELECT nextval('insta5.table_id_seq') % 1024 INTO seq_id;
SELECT FLOOR(EXTRACT(EPOCH FROM clock_timestamp()) * 1000) INTO now_millis;
result := (now_millis - our_epoch) << 23;
result := result | (shard_id << 10);
result := result | (seq_id);
RETURN result;
END;
$$ LANGUAGE plpgsql;
COMMENT ON FUNCTION insta5.next_id(smallint)
IS 'Modifications made to require shard id';
CREATE OR REPLACE FUNCTION insta5.get_shard(id insta5.id)
RETURNS smallint
AS $$
SELECT ((id<<41)>>51)::smallint;
$$ LANGUAGE sql
IMMUTABLE;
COMMENT ON FUNCTION insta5.get_shard(insta5.id)
IS '13 bits from insta5.id representing shard';
CREATE OR REPLACE FUNCTION insta5.get_sequence(id insta5.id)
RETURNS smallint
AS $$
SELECT ((id<<54)>>54)::smallint;
$$ LANGUAGE sql
IMMUTABLE;
COMMENT ON FUNCTION insta5.get_sequence(insta5.id)
IS '10 bits from insta5.id representing sequence';
CREATE OR REPLACE FUNCTION insta5.get_ts(id insta5.id)
RETURNS timestamp without time zone
AS $$
SELECT to_timestamp(((id >> 23) + 1314220021721 ) / 1000 )::timestamp without time zone;
$$ LANGUAGE sql
IMMUTABLE;
COMMENT ON FUNCTION insta5.get_ts(insta5.id)
IS '41 bits from insta5.id representing timestamp';
Everything works as before, but now you can
CREATE SCHEMA mySchema;
CREATE TABLE mySchema.mydata ( insta5id insta5.id ) ;
This is likely the best solution you can obtain shy of a C implementation, and you probably don't want to generate an insta5id
ever. That's their job. ;)
As another important aside, you likely never want to do this. Don't follow by example. This is what the uuid
type is for, and you should be using it rather than hand rolling your own. Specifically, this is eerily similar to uuid_generate_v1()
in uuid-ossp
, which stores a MAC (shard), and timestamp
This function generates a version 1 UUID. This involves the MAC address of the computer and a time stamp. Note that UUIDs of this kind reveal the identity of the computer that created the identifier and the time at which it did so, which might make it unsuitable for certain security-sensitive applications.
The values are between 2^63 and 2^64, so (just) too big to fit inside a BIGINT.
The range of a bigint
is -9223372036854775808 to +9223372036854775807, which is -2^63 to 2^63-1 — or 2^64 distinct integers. The range of your identifiers is 2^63 distinct integers so they'll fit nicely in a bigint
as long as you don't mind an offset:
select round(power(2::numeric,63::numeric)) "2^63" ,round(power(2::numeric,64::numeric)) "2^64" ,(9223372036854775808::numeric-9223372036854775809::numeric)::bigint "offset low val" ,(18446744073709551616::numeric-9223372036854775809::numeric)::bigint "offset high val";
2^63 | 2^64 | offset low val | offset high val ------------------: | -------------------: | -------------: | ------------------: 9223372036854775808 | 18446744073709551616 | -1 | 9223372036854775807
dbfiddle here
My example uses an offset of -9223372036854775809 (-2^63+1), but you are free to chose any offset that wont overflow the bigint
.
So you'll need to use numeric
when presenting the keys, and when applying the offset, but not for actually storing the keys or operations like sorting.