What's the best module for interacting with HDFS with Python3?
As far as I know, there are not as many possibilities as one may think. But I'd suggest the official Python Package hdfs 2.0.12
which can be downloaded the website or from terminal by running:
pip install hdfs
Some of the features:
- Python (2 and 3) bindings for the WebHDFS (and HttpFS) API, supporting both secure and insecure clusters.
- Command line interface to transfer files and start an interactive client shell, with aliases for convenient namenode URL caching.
- Additional functionality through optional extensions: avro, to read and write Avro files directly from HDFS. dataframe, to load and save Pandas dataframes. kerberos, to support Kerberos authenticated clusters.
I have tried snakebite, hdfs3 and hdfs.
Snakebite supports only download (no upload) so it's no go for me.
Out of these 3 only hdfs3 supports HA set up, so it was my choice, however I didn't manage to make it work with multihomed networks using datanode hostnames (problem described here: https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresses/)
So I ended up using hdfs (2.0.16) as it supports uploads. I had to add some workaround using bash to support HA.
PS. There's interesting article comparing Python libraries developed for interacting with the Hadoop File System at http://wesmckinney.com/blog/python-hdfs-interfaces/
pyarrow
, the python implementation of apache arrow has a well maintained and documented HDFS client: https://arrow.apache.org/docs/python/filesystems.html