Understanding LongWritable

Hadoop needs to be able to serialise data in and out of Java types via DataInput and DataOutput objects (IO Streams usually). The Writable classes do this by implementing two methods `write(DataOuput) and readFields(DataInput).

Specifically LongWritable is a Writable class that wraps a java long.

Most of the time (especially just starting out) you can mentally replace LongWritable -> Long i.e. it's just a number. If you get to defining your own datatypes you will start to become every familiar with implementing the writable interface:

Which looks some thing like:

public interface Writable {

       public void write(DataOutput out) throws IOException;

       public void readFields(DataInput in) throws IOException;
}

From Apache documentation page,

Writable is described as :

serializable object which implements a simple, efficient, serialization protocol, based on DataInput and DataOutput.

LongWritable is A WritableComparable for longs.

Need for Writables:

In Hadoop, interprocess communication was built with remote procedure calls ( RPC). The RPC protocol uses serialization to render the message into a binary stream at sender and it will be deserialized into the original message from binary stream at receiver.

Java Serialization has many disadvantages with respect to performance and efficiency. Java serialization is much slower than using in memory stores and tends to significantly expand the size of the object. Java Serialization also creates a lot of garbage.

Refer to these two posts:

dzone article

https://softwareengineering.stackexchange.com/questions/191269/java-serialization-advantages-and-disadvantages-use-or-avoid

For effectiveness of Hadoop, the serialization/de-serialization process should be optimized because huge number of remote calls happen between the nodes in the cluster. So the serialization format should be fast, compact, extensible and interoperable. Due to this reason, Hadoop framework has come up with own IO classes to replace java primitive data types. e.g. IntWritbale for int, LongWritable for long, Text for String etc.

You can get more details if you refer to "Hadoop the definitive guide" fourth edition.

The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function.

public class MaxTemperatureMapper
    extends Mapper<LongWritable, Text, Text, IntWritable> {
     @Override
    public void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {

    }
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException 
    {
    }

}

For the code example, the input key is a long integer offset, the input value is a line of text. the output key is integer, and the output value is an integer. Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package.

Here we use LongWritable, which corresponds to a Java Long, Text (like Java String),and IntWritable (like Java Integer).

Understanding LongWritable

Tags:

Java

Hadoop

Mapreduce

Related

Recent Posts