Google BigQuery There are no primary key or unique constraints, how do you prevent duplicated records being inserted?

So let's clear some facts up in the first place.

Bigquery is a managed data warehouse suitable for large datasets, and it's complementary to a traditional database, rather than a replacement.

You can only do a maximum of 96 DML (update,delete) operations on a table per day. This is by design. This limit is low because it forces you to think of BQ as a data lake. So, on BigQuery, you actually let all data in, and everything is append-only by design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.

We actually leverage insights from every new update we add to the same row. For example, we can detect how long it took for the end-user to choose his/her country at signup flow. Because we have a dropdown of countries, it took some time until he/she scrolled to the right country, and metrics show this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country; it's faster.

"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too.

"Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".

When we run Youtube Bigquery Transfer backfill (ask youtube bigquery transfer to push the reports for certain dates again.) Youtube BigQury Transfer services will first, delete the full dataset for that date in the report tables and then insert the full dataset of that date back to the report tables again.

Another approach is drop the results table (if it already exists) first, and then re-create the results table and re-input the results into the tables again. I used this approach a lot. Everyday, I have my process data results saved in some results tables in the daily dataset. If I rerun the process for that day, my script will check if the results tables for that day exist or not. If table exists for that day, delete it and then re-create a fresh new table, and then reinput the process results to the new created table.

Google BigQuery There are no primary key or unique constraints, how do you prevent duplicated records being inserted?

Tags:

Google Bigquery

Related

Recent Posts