BigQuery - DELETE statement to remove duplicates

This has to be the easiest way:

create or replace table `myproject.mydataset.duplicates` as (
select distinct *
from `myproject.mydataset.duplicates`)

If you have an array data type, try this:

-- build a test table with a duplicate and an array datatype column --
create or replace table DW.pmoTest as (
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 2, 3]  as int_array, current_date as createdate
union all
select 1 as ID, 'peter' as firstname,ARRAY<INT64>[1, 7, 3] as int_array, current_date as createdate
union all
select 2 as ID, 'chamri' as firstname,ARRAY<INT64>[1, 2, 39, 4] as int_array, current_date as createdate
);

-- recreate table without duplicate row
create or replace table DW.pmoTest as (
SELECT col.* FROM (
  SELECT ARRAY_AGG(tbl ORDER BY createdate LIMIT 1)[OFFSET(0)]  col
  FROM DW.pmoTest tbl
  GROUP BY ID
  )
);

Below actually :o) works

#standardSQL
DELETE FROM `yourproject.yourdataset.duplicates`
WHERE STRUCT(id, loadTime) NOT IN (
        SELECT AS STRUCT id, MAX(loadTime) loadTime 
        FROM `yourproject.yourdataset.duplicates` 
        GROUP BY id)

Note: it assumes that loadTime is also unique - meaning if for given id there are more than one record with latest loadTime - they all will be preserved

From the syntax documentation, the argument to DELETE needs to be a table, and there is no provision for using a WITH clause. This makes sense given that you can't delete from what is essentially a logical view (a CTE). You can express what you want by putting the logic inside the filter, e.g.

DELETE
FROM duplicates AS d
WHERE (SELECT ROW_NUMBER() OVER (PARTITION BY id ORDER BY loadTime DESC)
       FROM `duplicates` AS d2
       WHERE d.id = d2.id AND d.loadTime = d2.loadTime) > 1;

BigQuery - DELETE statement to remove duplicates

Tags:

Google Bigquery

Related

Recent Posts