Delete duplicate rows from a BigQuery table

Not sure why nobody mentioned DISTINCT query.

Here is the way to clean duplicate rows:

CREATE OR REPLACE TABLE project.dataset.table
AS
SELECT DISTINCT * FROM project.dataset.table

UPDATE 2019: To de-duplicate rows on a single partition with a MERGE, see:

https://stackoverflow.com/a/57900778/132438

An alternative to Jordan's answer - this one scales better when having too many duplicates:

#standardSQL
SELECT event.* FROM (
  SELECT ARRAY_AGG(
    t ORDER BY t.created_at DESC LIMIT 1
  )[OFFSET(0)]  event
  FROM `githubarchive.month.201706` t 
  # GROUP BY the id you are de-duplicating by
  GROUP BY actor.id
)

Or a shorter version (takes any row, instead of the newest one):

SELECT k.*
FROM (
  SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
  FROM `fh-bigquery.reddit_comments.2017_01` x 
  GROUP BY id
)

To de-duplicate rows on an existing table:

CREATE OR REPLACE TABLE `deleting.deduplicating_table`
AS
# SELECT id FROM UNNEST([1,1,1,2,2]) id
SELECT k.*
FROM (
  SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] k 
  FROM `deleting.deduplicating_table` row
  GROUP BY id
)

You can remove duplicates by running a query that rewrites your table (you can use the same table as the destination, or you can create a new table, verify that it has what you want, and then copy it over the old table).

A query that should work is here:

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY Fixed_Accident_Index)
          row_number
  FROM Accidents.CleanedFilledCombined
)
WHERE row_number = 1

Delete duplicate rows from a BigQuery table

Tags:

Distinct

Google Bigquery

Related

Recent Posts