Is Snappy splittable or not splittable?

Both are correct but in different levels.

According with Cloudera blog http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/

One thing to note is that Snappy is intended to be used with a
container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce. This is different to LZO, where is is possible to index LZO compressed files to determine split points so that LZO files can be processed efficiently in subsequent processing.

This means that if a whole text file is compressed with Snappy then the file is NOT splittable. But if each record inside the file is compressed with Snappy then the file could be splittable, for example in Sequence files with block compression.

To be more clear, is not the same:

<START-FILE>
  <START-SNAPPY-BLOCK>
     FULL CONTENT
  <END-SNAPPY-BLOCK>
<END-FILE>

than

<START-FILE>
  <START-SNAPPY-BLOCK1>
     RECORD1
  <END-SNAPPY-BLOCK1>
  <START-SNAPPY-BLOCK2>
     RECORD2
  <END-SNAPPY-BLOCK2>
  <START-SNAPPY-BLOCK3>
     RECORD3
  <END-SNAPPY-BLOCK3>
<END-FILE>

Snappy blocks are NOT splittable but files with snappy blocks are splittables.


All splittable codecs in hadoop must implement org.apache.hadoop.io.compress.SplittableCompressionCodec. Looking at the hadoop source code as of 2.7, we see org.apache.hadoop.io.compress.SnappyCodec does not implement this interface, so we know it is not splittable.


I have just tested with Spark 1.6.2 on HDFS, for same number of workers/processor, between a simple JSON file and compressed by snappy:

  • JSON: 4 files of 12GB each, Spark creates 388 tasks (1 task by HDFS block) (4*12GB/128MB => 384)
  • Snappy: 4 files of 3GB each, Spark creates 4 tasks

Snappy file is created like this: .saveAsTextFile("/user/qwant/benchmark_file_format/json_snappy", classOf[org.apache.hadoop.io.compress.SnappyCodec])

So Snappy is no splittable with Spark for JSON.

But, if you use parquet (or ORC) file format instead JSON, this will be splitable (even with gzip).

Tags:

Hadoop

Snappy