How to register S3 Parquet files in a Hive Metastore using Spark on EMR

The solution was to register the S3 file as an external table.

sqlContext.createExternalTable("foo", "s3://bucket/key/prefix/foo/parquet")

I haven't figured out how to save a file to S3 and register it as an external table all in one shot, but createExternalTable doesn't add too much overhead.


The way I solve this problem is: First Create the hive table in the spark:

schema = StructType([StructField("key", IntegerType(), True),StructField("value", StringType(), True)])
df = spark.catalog \
          .createTable("data1", "s3n://XXXX-Buket/data1",schema=schema)

Next, in Hive, it will appear the table that created from spark as above. (in this case data1)

In addition, in the other hive engine, you can link to this data is S3 by create external table data with the same type as created in spark: command:

CREATE EXTERNAL TABLE data1 (key INT, value String) STORED AS PARQUET LOCATION 's3n://XXXX-Buket/data1’