What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?
ASN.1 is an ISO/ISE standard. It has a very readable source language and a variety of back-ends, both binary and human-readable. Being an international standard (and an old one at that!) the source language is a bit kitchen-sinkish (in about the same way that the Atlantic Ocean is a bit wet) but it is extremely well-specified and has decent amount of support. (You can probably find an ASN.1 library for any language you name if you dig hard enough, and if not there are good C language libraries available that you can use in FFIs.) It is, being a standardized language, obsessively documented and has a few good tutorials available as well.
Thrift is not a standard. It is originally from Facebook and was later open-sourced and is currently a top level Apache project. It is not well-documented -- especially tutorial levels -- and to my (admittedly brief) glance doesn't appear to add anything that other, previous efforts don't already do (and in some cases better). To be fair to it, it has a rather impressive number of languages it supports out of the box including a few of the higher-profile non-mainstream ones. The IDL is also vaguely C-like.
Protocol Buffers is not a standard. It is a Google product that is being released to the wider community. It is a bit limited in terms of languages supported out of the box (it only supports C++, Python and Java) but it does have a lot of third-party support for other languages (of highly variable quality). Google does pretty much all of their work using Protocol Buffers, so it is a battle-tested, battle-hardened protocol (albeit not as battle-hardened as ASN.1 is. It has much better documentation than does Thrift, but, being a Google product, it is highly likely to be unstable (in the sense of ever-changing, not in the sense of unreliable). The IDL is also C-like.
All of the above systems use a schema defined in some kind of IDL to generate code for a target language that is then used in encoding and decoding. Avro does not. Avro's typing is dynamic and its schema data is used at runtime directly both to encode and decode (which has some obvious costs in processing, but also some obvious benefits vis a vis dynamic languages and a lack of a need for tagging types, etc.). Its schema uses JSON which makes supporting Avro in a new language a bit easier to manage if there's already a JSON library. Again, as with most wheel-reinventing protocol description systems, Avro is also not standardized.
Personally, despite my love/hate relationship with it, I'd probably use ASN.1 for most RPC and message transmission purposes, although it doesn't really have an RPC stack (you'd have to make one, but IOCs make that simple enough).
We just did an internal study on serializers, here are some results (for my future reference too!)
Thrift = serialization + RPC stack
The biggest difference is that Thrift is not just a serialization protocol, it's a full blown RPC stack that's like a modern day SOAP stack. So after the serialization, the objects could (but not mandated) be sent between machines over TCP/IP. In SOAP, you started with a WSDL document that fully describes the available services (remote methods) and the expected arguments/objects. Those objects were sent via XML. In Thrift, the .thrift file fully describes the available methods, expected parameter objects and the objects are serialized via one of the available serializers (with Compact Protocol
, an efficient binary protocol, being most popular in production).
ASN.1 = Grand daddy
ASN.1 was designed by telecom folks in the 80s and is awkward to use due to limited library support as compared to recent serializers which emerged from CompSci folks. There are two variants, DER (binary) encoding and PEM (ascii) encoding. Both are fast, but DER is faster and more size efficient of the two. In fact ASN.1 DER can easily keep up (and sometimes beat) serializers that were designed 30 years after itself, a testament to it's well engineered design. It's very compact, smaller than Protocol Buffers and Thrift, only beaten by Avro. The issue is having great libraries to support and right now Bouncy Castle seems to be the best one for C#/Java. ASN.1 is king in security and crypto systems and isn't going to go away, so don't be worried about 'future proofing'. Just get a good library...
MessagePack = middle of the pack
It's not bad but it's neither the fastest, nor the smallest nor the best supported. No production reason to choose it.
Common
Beyond that, they are fairly similar. Most are variants of the basic TLV: Type-Length-Value
principle.
Protocol Buffers (Google originated), Avro (Apache based, used in Hadoop), Thrift (Facebook originated, now Apache project) and ASN.1 (Telecom originated) all involve some level of code generation where you first express your data in a serializer-specific format, then the serializer "compiler" will generate source code for your language via the code-gen
phase. Your app source then uses these code-gen
classes for IO. Note that certain implementations (eg: Microsoft's Avro library or Marc Gavel's ProtoBuf.NET) let you directly decorate your app level POCO/POJO objects and then the library directly uses those decorated classes instead of any code-gen's classes. We've seen this offer a boost performance since it eliminates a object copy stage (from application level POCO/POJO fields to code-gen fields).
Some results and a live project to play with
This project (https://github.com/sidshetye/SerializersCompare) compares important serializers in the C# world. The Java folks already have something similar.
1000 iterations per serializer, average times listed
Sorting result by size
Name Bytes Time (ms)
------------------------------------
Avro (cheating) 133 0.0142
Avro 133 0.0568
Avro MSFT 141 0.0051
Thrift (cheating) 148 0.0069
Thrift 148 0.1470
ProtoBuf 155 0.0077
MessagePack 230 0.0296
ServiceStackJSV 258 0.0159
Json.NET BSON 286 0.0381
ServiceStackJson 290 0.0164
Json.NET 290 0.0333
XmlSerializer 571 0.1025
Binary Formatter 748 0.0344
Options: (T)est, (R)esults, s(O)rt order, (S)erializer output, (D)eserializer output (in JSON form), (E)xit
Serialized via ASN.1 DER encoding to 148 bytes in 0.0674ms (hacked experiment!)
Adding to the performance perspective, Uber recently evaluated several of these libraries on their engineering blog:
https://eng.uber.com/trip-data-squeeze/
The winner for them? MessagePack + zlib for compression
Our goal was to find the combination of encoding protocol and compression algorithm with the most compact result at the highest speed. We tested encoding protocol and compression algorithm combinations on 2,219 pseudorandom anonymized trips from Uber New York City (put in a text file as JSON).
The lesson here is that your requirements drive which library is right for you. For Uber they couldn't use an IDL based protocol due to the schemaless nature of message passing they have. This eliminated a bunch of options. Also for them it's not only raw encoding/decoding time that comes into play, but the size of data at rest.
Size Results
Speed Results