Hadoop and HBase

There's little to add to what I've already being said. Hadoop is a distributed filesystem (HDFS) and MapReduce (a framework for distributed computing). HBase is key-value data store built on top of Hadoop (meaning on top of HDFS).

The reason to use HBase instead of plain Hadoop is mainly to do random reads and writes. If you are using plain Hadoop you got to read the whole dataset whenever you want to run a MapReduce job.

I also find useful to import data to HBase if I'm working with thousands of small files.

I recommend you this talk by Todd Lipcon (Cloudera): "Apache HBase: an introduction" http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction


Hadoop is a platform that allows us to store and process large volumes of data across clusters of machines in a parallel manner..It is a batch processing system where we don't have to worry about the internals of data storage or processing.

It not only provides HDFS, the distributed file system for reliable data storage but also a processing framework, MapReduce, that allows processing of huge data sets across clusters of machines in a parallel manner.

One of the biggest advantage of Hadoop is that it provides data locality.By that I mean that moving data that is do huge is costly. So Hadoop moves computation to the data.Both Hdfs and MapReduce are highly optimized to work with really large data.

HDFS assures high availability and failover through data replication, so that if any one the machines in your cluster is down because of some catastrophe, your data is still safe and available.

On the other hand HBase is a NoSQL database.We can think of it as a distributed, scalable, big data store. It is used to overcome the pitfalls of Hdfs like "inability of random read and write".

Hbase is a suitable choice if we need random, realtime read/write access to our data.It was modeled after Google's "BigTable", while Hdfs was modeled after the GFS(Google file system).

It is not necessary to use Hbase on top Hdfs only.We can use Hbase with other persistent store like "S3" or "EBS". If you want to know about Hadoop and Hbase in deatil, you can visit the respective home pages -"hadoop.apache.org" and "hbase.apache.org".

You can also go through the following books if you want to learn in depth "Hadoop.The.Definitive.Guide" and "HBase.The.Definitive.Guide".


The Hadoop distributed file system named as HDFS provides multiple jobs for us. Actually we can't say Hadoop is only a file system but it also provide us resources so can we perform distributed processing by providing us a master slave architecture from which we can easily manage our data.

As for the HBase concern , simply let me tell you that you can't connect remotely to HBase without using HDFS because HBase can't create clusters and it has its own local file system.

I think you should see this link for good intro of
hadoop!

Tags:

Hadoop

Hbase