Why is varint an efficient data representation?
It's a tradeoff that depends on the typical numbers you want to send.
- If they're typically small, use a "varint" encoding (
int32, int64, uint32, uint64, sint32, sint64, bool, enum
) - If they're typically large, use non-"varint" encoding (
fixed32, sfixed32, float, fixed64, sfixed64, double
)
Here is a comparison of bytes used of sint64
, int64
and fixed64
for various numbers:
[ RUN ] ProtobufOutputStream.printNumberOfBytesOnWireForInts
number sint64 int64 fixed64
0 2 2 9
1 2 2 9
-1 2 11 9
63 2 2 9
-63 2 11 9
64 3 2 9
-64 2 11 9
10000 4 3 9
-10000 4 11 9
9223372036854775807 11 10 9
-9223372036854775808 11 11 9
It's to save space/bandwidth e.g. many programming languages and protocols have a fixed sized data types. e.g. an uint8_t, an uint16_t, uint32_t etc. and these takes up a fixed amount of bytes regardless of how big the value is. e.g. if you store the value 2
in an uint32_t, that takes up 4 bytes.
With an encoding such as varint used in protobuf, small values can take up a smaller space, and the value 2
only needs 1 byte of space to be transferred on the wire, while still being flexible enough to not restrict the range of the value that can be used.
This is often a net win if small values are more common than big values - which is often the case.
In practice, the vast majority of integer values are small. Even in cases where you expect that a value will sometimes be very large, and thus you make it 32-bit or even 64-bit, chances are it will usually be small, because statistically speaking most physical quantities follow power-law distributions. So if the small values can be stored in fewer bytes, it's OK if the large values take an extra byte.
About the only kinds of integers that don't benefit are things like hashes or randomly-generated ID numbers which don't actually represent a quantity, but just a bit string. For these, you should use Protobufs' fixed32
or fixed64
types.
Note that varint encoding saves space on the wire but is actually relatively slow, because it requires a lot of branches to encode/decode. It's not as slow as text encoding, of course, but as binary formats go it's not so great. This is one reason why Cap'n Proto decided to reverse this decision and just put fixed-width ints on the wire. Cap'n Proto also includes an optional "packing" algorithm which compresses away zero-valued bytes, which produces similar message sizes to Protobuf but is generally faster because the algorithm uses less branching.
(Disclosure: I am the author of Cap'n Proto, and also of most of the Protobuf code released by Google.)