Understanding LongWritable
Hadoop needs to be able to serialise data in and out of Java types via DataInput
and DataOutput
objects (IO Streams usually). The Writable classes do this by implementing two methods `write(DataOuput) and readFields(DataInput).
Specifically LongWritable
is a Writable
class that wraps a java long.
Most of the time (especially just starting out) you can mentally replace LongWritable
-> Long
i.e. it's just a number. If you get to defining your own datatypes you will start to become every familiar with implementing the writable interface:
Which looks some thing like:
public interface Writable {
public void write(DataOutput out) throws IOException;
public void readFields(DataInput in) throws IOException;
}
From Apache documentation page,
Writable
is described as :
serializable object which implements a simple, efficient, serialization protocol, based on DataInput and DataOutput.
LongWritable
is A WritableComparable for longs.
Need for Writables:
In Hadoop, interprocess communication was built with remote procedure calls ( RPC). The RPC protocol uses serialization to render the message into a binary stream at sender and it will be deserialized into the original message from binary stream at receiver.
Java Serialization has many disadvantages with respect to performance and efficiency. Java serialization is much slower than using in memory stores and tends to significantly expand the size of the object. Java Serialization also creates a lot of garbage.
Refer to these two posts:
dzone article
https://softwareengineering.stackexchange.com/questions/191269/java-serialization-advantages-and-disadvantages-use-or-avoid
For effectiveness of Hadoop, the serialization/de-serialization process should be optimized because huge number of remote calls happen between the nodes in the cluster. So the serialization format should be fast, compact, extensible and interoperable
. Due to this reason, Hadoop framework has come up with own IO classes to replace java primitive data types. e.g. IntWritbale
for int
, LongWritable
for long
, Text
for String
etc.
You can get more details if you refer to "Hadoop the definitive guide" fourth edition.
The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function.
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
}
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException
{
}
}
For the code example, the input key is a long integer offset, the input value is a line of text. the output key is integer, and the output value is an integer. Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package.
Here we use LongWritable, which corresponds to a Java Long, Text (like Java String),and IntWritable (like Java Integer).