How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?
Suppose your nameservice is 'hadooptest', then set the hadoop configurations like below. You can get these information from hdfs-site.xml file of remote HA enabled HDFS.
sc.hadoopConfiguration.set("dfs.nameservices", "hadooptest")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.hadooptest", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
sc.hadoopConfiguration.set("dfs.ha.namenodes.hadooptest", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn1", "10.10.14.81:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.hadooptest.nn2", "10.10.14.82:8020")
After this, you can use the URL with 'hadooptest' like below.
test.write.orc("hdfs://hadooptest/tmp/test/r1")
check here for more information.
If you want to make a H/A HDFS cluster as your default config (mostly the case) that applies to every application started through spark-submit
or spark-shell
. you could write the cluster information into spark-defaults.conf
.
sudo vim $SPARK_HOME/conf/spark-defaults.conf
And add the following lines. assuming your HDFS cluster name is hdfs-k8s
spark.hadoop.dfs.nameservices hdfs-k8s
spark.hadoop.dfs.ha.namenodes.hdfs-k8s nn0,nn1
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn0 192.168.23.55:8020
spark.hadoop.dfs.namenode.rpc-address.hdfs-k8s.nn1 192.168.23.56:8020
spark.hadoop.dfs.client.failover.proxy.provider.hdfs-k8s org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
It should work when your next application launched.
sc.addPyFile('hdfs://hdfs-k8s/user/root/env.zip')