Spark Standalone Mode: How to compress spark output written to HDFS
Another way to save gzipped files to HDFS or Amazon S3 directory system is to use the saveAsHadoopFile method.
someMap is RDD[(K,V)], if you have someMap as RDD[V], you can call someMap.map(line=>(line, "") to use saveAsHadoopFile method.
import org.apache.hadoop.io.compress.GzipCodec
someMap.saveAsHadoopFile(output_folder_path, classOf[String], classOf[String], classOf[MultipleTextOutputFormat[String, String]], classOf[GzipCodec])
The method saveAsTextFile
takes an additional optional parameter of the codec class to use. So for your example it should be something like this to use gzip:
someMap.saveAsTextFile("hdfs://HOST:PORT/out", classOf[GzipCodec])
UPDATE
Since you're using 0.7.2 you might be able to port the compression code via configuration options that you set at startup. I'm not sure if this will work exactly, but you need to go from this:
conf.setCompressMapOutput(true)
conf.set("mapred.output.compress", "true")
conf.setMapOutputCompressorClass(c)
conf.set("mapred.output.compression.codec", c.getCanonicalName)
conf.set("mapred.output.compression.type", CompressionType.BLOCK.toString)
to something like this:
System.setProperty("spark.hadoop.mapred.output.compress", "true")
System.setProperty("spark.hadoop.mapred.output.compression.codec", "true")
System.setProperty("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.GzipCodec")
System.setProperty("spark.hadoop.mapred.output.compression.type", "BLOCK")
If you get it to work, posting your config would probably be helpful to others as well.