overwrite hive partitions using spark

If you are on Spark 2.3.0, try setting spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite.

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
data.write.mode("overwrite").insertInto("partitioned_table")

I would suggest to run sql using sparksession. you can run " insert overwrite partition query" by selecting the columns from existing dataset. this solution will surely overwrites partition only.

So, if you are using Spark version < 2.3 and want to write into partitions dynamically without deleting the others, you can implement the below solution.

The idea is to register the dataset as a table and then use spark.sql() to run the INSERT query.

// Create SparkSession with Hive dynamic partitioning enabled
val spark: SparkSession =
    SparkSession
        .builder()
        .appName("StatsAnalyzer")
        .enableHiveSupport()
        .config("hive.exec.dynamic.partition", "true")
        .config("hive.exec.dynamic.partition.mode", "nonstrict")
        .getOrCreate()
// Register the dataframe as a Hive table
impressionsDF.createOrReplaceTempView("impressions_dataframe")
// Create the output Hive table
spark.sql(
    s"""
      |CREATE EXTERNAL TABLE stats (
      |   ad            STRING,
      |   impressions   INT,
      |   clicks        INT
      |) PARTITIONED BY (country STRING, year INT, month INT, day INT)
      |ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
    """.stripMargin
)
// Write the data into disk as Hive partitions
spark.sql(
    s"""
      |INSERT OVERWRITE TABLE stats 
      |PARTITION(country = 'US', year = 2017, month = 3, day)
      |SELECT ad, SUM(impressions), SUM(clicks), day
      |FROM impressions_dataframe
      |GROUP BY ad
    """.stripMargin
)

overwrite hive partitions using spark

Tags:

Hadoop

Amazon Web Services

Scala

Hive

Apache Spark

Related

Recent Posts