What is "Hadoop" - the definition of Hadoop?

I agree with your impression that the "Hadoop" term does not have a useful definition. "We have a Hadoop cluster" may mean various things.

There is an official answer though at http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F:

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

So "Hadoop" is the name of a project and a software library. Any other usage is ill-defined.


The most generally accepted understanding of Hadoop: HDFS and Map/Reduce and their related processes and tooling.

A related term: Hadoop ecosystem: Hive/Pig/Hbase, Zookeeper, Oozie. Also vendor specific ones such as impala, ambari.


In addition to Apache hadoop definition from Official website, I would like to highlight that Hadoop is a framework and there are many sub-systems in Hadoop ecosystem

Quoting this content from official website so that broken links in future does not cause any issue to this answer.

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.

Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

Hadoop YARN: A framework for job scheduling and cluster resource management.

Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

More or less,

Hadoop = Distributed Storage (HDFS) + Distributed Processing ( YARN + Map Reduce)

But these four modules does not cover complete Hadoop Ecosystem. There are many Hadoop related projects and 40+ subsystems in Hadoop Ecosystems.

Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.

Avro™: A data serialization system.

Cassandra™: A scalable multi-master database with no single points of failure.

Chukwa™: A data collection system for managing large distributed systems.

HBase™: A scalable, distributed database that supports structured data storage for large tables.

Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

Mahout™: A Scalable machine learning and data mining library.

Pig™: A high-level data-flow language and execution framework for parallel computation.

Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.

Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

ZooKeeper™: A high-performance coordination service for distributed applications.

Coming back to your question:

Just have a look at 40+ sub systems in Hadoop eco system. Every thing you have quoted may not be Hadoop but most them are related to Hadoop.

Spark is part of Hadoop ecosystem. But it can neither use HDFS nor YARN. HDFS data sets can be replaced with RDD ( resilient distributed dataset) and can run in Standalone mode without YARN.

Have a look at this article and this article for Hadoop & Spark comparison.

Spark use cases over Hadoop:

  1. Iterative Algorithms in Machine Learning
  2. Interactive Data Mining and Data Processing
  3. Stream processing
  4. Sensor data processing

Since Spark doesn't have storage system, it have to depend on one of distributed storages where HDFS is one of them.

Have a look at related SE question:

Can apache spark run without hadoop?