Hadoop on Kubernetes vs Standard Hadoop

Hadoop: Hadoop provides HDFS as a distributed file system, where cluster of storage resources are presented to application stack as a single file or file system. HDFS API's used to access large set of data that is not feasible to store on a single hard disk. Hadoop includes data reliability management through replication so that applications don't have to worry about storage stack semantics.

In summary - Hadoop converts multiple hard disks into single volume. This enables very large data storage. Hadoop uses YARN (Yet another resource negotiator) as a compute node global scheduler. Hadoop ecosystem is very large and includes Spark, Zookeeper, Hbase, Hive and many other solutions towards big data, analytics and machine learning.

Kubernetes: Kubernetes is a container orchestration platform. Container replaces VM's, because VM's utilize hyper-wiser's and entire guest OS to isolate compute nodes within a single host OS environment. That means VM's create unnecessary duplication and are less efficient by running multiple OS on top of each other. This efficiency loss is obvious in cloud environment running 100's of thousand VMs. Containers solves this problem of duplication and emulate OS level isolation by only invoking required functionality without requiring entire OS image for each container. This is achieved through linux etcd functionality. Kubernetes provides means to manage multiple containers through 2 level isolation. 2nd level isolation is provided through PODS that contain multiple containers. Kubernetes also provide load balancing and fail safe deployment through container replication.

In summary - Containers provide OS level isolation to make single OS look like many OS's. This enables efficient use of resources and parallel running of applications. Kubernetes enable management of many containers. Generally Kubernetes and containers interface cloud storage platforms like S3, where S3 is a object store and not like Hadoop distributed file system.

Hadoop with or without Kubernetes: While fundamentally Hadoop and Kubernetes solves different problems, Kubernetes have gained popularity due to containerization benefits by solving application dependency and deployment challenges. Kubernetes and containers provides massive parallelism and scalability.

Kubernetes is fairly recent advancement in open source even-though Google have been using this for many years, on the other hand Hadoop is a decade old solution and lacks some of the modernization.

So do we need to use Hadoop as a distributed file system with Containers and Kubernetes? It really depends on application requirements and value proposition needs. Technically it's feasible to run Hadoop with Docker and Kubernetes, however the entire ecosystem lacks smooth integration. Recent couple of open source projects try to solve this problem however if Hadoop will be a going forward solution or we need a new/different distributed file system platform only time will tell. Currently we have many solutions like Cloud storage platforms, Kafka, Elastic-search/logstash solves the storage scalability problem with their own strengths in specific areas while Hadoop and entire Hadoop ecosystem continue to be a dominant big data platform.


  • Standard hadoop is just hadoop with map-reduce , spark etc and backed by HDFS

  • Hadoop on kubernetes is just standard Hadoop as above , but running on Kubernetes

In case of Hadoop on K8S , you get all the benefits that kubernetes usually offers over traditional infrastructure.

There is a helm chart as well:

https://github.com/helm/charts/tree/master/stable/hadoop


you might want to consider looking at this set of charts In short, this is a collection of helm charts to spin up Hadoop services on K8s cluster.

to mention a few highlights:

  • support HA namenode
  • support of Kerberos
  • support k8s persistent vols
  • support of data volumes
  • etc

Hope this helps. Cheers


As people have said, "the only difference is you are in kubernetes/container". The reality is that means a couple of huge things in terms of actual operation:

  • The helm chart linked above is a toy.
    • It builds vanilla hadoop (i.e. not HDP or CDH)
    • It doesn't do HA namenodes
    • It doesn't do kerberos
  • You have to manage your own volumes
    • If you are running on a public cloud this isn't a super big deal, as you can dynamically get storage

So unless you just want a super lightweight hdfs deployment, or you are comfortable/willing to build out your own deployment of a more sophisticated k8s hadoop deployment, or you are willing to pay for a 3rd party kubernetes stack with hadoop support (e.g. robin.io), I would say that in general it is not worth running on k8s at this point.

Note that if/when the hadoop vendors make their own operator, this might change.