BigQuery: Deleting Duplicates in Partitioned Table

Kind of a hack, but you can use the MERGE statement to delete all of the contents of the table and reinsert only distinct rows atomically. Here's an example:

Click to copy

-- Create a table with some duplicate rows
CREATE TABLE dataset.PartitionedTable
PARTITION BY date AS
SELECT x, CONCAT('foo', CAST(x AS STRING)) AS y, DATE_SUB(CURRENT_DATE(), INTERVAL x DAY) AS date
FROM UNNEST(GENERATE_ARRAY(1, 10)) AS x, UNNEST(GENERATE_ARRAY(1, 10));

Now for the MERGE part:

Click to copy

-- Execute a MERGE statement where all original rows are deleted,
-- then replaced with new, deduplicated rows:
MERGE dataset.PartitionedTable AS t1
USING (SELECT DISTINCT * FROM dataset.PartitionedTable) AS t2
ON FALSE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW
WHEN NOT MATCHED BY SOURCE THEN DELETE

BigQuery: Deleting Duplicates in Partitioned Table

Tags:

Google Bigquery

Bigquery Standard Sql

Related

Recent Posts