Why is varint an efficient data representation?

It's a tradeoff that depends on the typical numbers you want to send.

If they're typically small, use a "varint" encoding (int32, int64, uint32, uint64, sint32, sint64, bool, enum)
If they're typically large, use non-"varint" encoding (fixed32, sfixed32, float, fixed64, sfixed64, double)

Here is a comparison of bytes used of sint64, int64 and fixed64 for various numbers:

[ RUN      ] ProtobufOutputStream.printNumberOfBytesOnWireForInts
                        number  sint64   int64 fixed64
                             0       2       2       9
                             1       2       2       9
                            -1       2      11       9
                            63       2       2       9
                           -63       2      11       9
                            64       3       2       9
                           -64       2      11       9
                         10000       4       3       9
                        -10000       4      11       9
           9223372036854775807      11      10       9
          -9223372036854775808      11      11       9

It's to save space/bandwidth e.g. many programming languages and protocols have a fixed sized data types. e.g. an uint8_t, an uint16_t, uint32_t etc. and these takes up a fixed amount of bytes regardless of how big the value is. e.g. if you store the value 2 in an uint32_t, that takes up 4 bytes.

With an encoding such as varint used in protobuf, small values can take up a smaller space, and the value 2 only needs 1 byte of space to be transferred on the wire, while still being flexible enough to not restrict the range of the value that can be used.

This is often a net win if small values are more common than big values - which is often the case.

In practice, the vast majority of integer values are small. Even in cases where you expect that a value will sometimes be very large, and thus you make it 32-bit or even 64-bit, chances are it will usually be small, because statistically speaking most physical quantities follow power-law distributions. So if the small values can be stored in fewer bytes, it's OK if the large values take an extra byte.

About the only kinds of integers that don't benefit are things like hashes or randomly-generated ID numbers which don't actually represent a quantity, but just a bit string. For these, you should use Protobufs' fixed32 or fixed64 types.

Note that varint encoding saves space on the wire but is actually relatively slow, because it requires a lot of branches to encode/decode. It's not as slow as text encoding, of course, but as binary formats go it's not so great. This is one reason why Cap'n Proto decided to reverse this decision and just put fixed-width ints on the wire. Cap'n Proto also includes an optional "packing" algorithm which compresses away zero-valued bytes, which produces similar message sizes to Protobuf but is generally faster because the algorithm uses less branching.

(Disclosure: I am the author of Cap'n Proto, and also of most of the Protobuf code released by Google.)

Why is varint an efficient data representation?

Tags:

Integer

Protocol Buffers

Data Representation

Related

Recent Posts