Pyspark AWS credentials

Setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key in spark-defaults.conf before establishing a spark session is a nice way to do it.

But, also had success with Spark 2.3.2 and a pyspark shell setting these dynamically from within a spark session doing the following:

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_ACCESS_KEY_ID)
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET_ACCESS_KEY)

And then, able to read/write from S3 using s3a:

documents = spark.sparkContext.textFile('s3a://bucket_name/key')

For pyspark we can set the credentials as given below

  sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY)
  sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_KEY)

Pyspark AWS credentials

Tags:

Amazon S3

Amazon Web Services

Apache Spark

Pyspark

Related

Recent Posts