efficient way to compare two tables in bigquery
In Standard SQL, we found using a UNION ALL
of two EXCEPT DISTINCT
's works for our use cases:
(
SELECT * FROM table1
EXCEPT DISTINCT
SELECT * from table2
)
UNION ALL
(
SELECT * FROM table2
EXCEPT DISTINCT
SELECT * from table1
)
This will produce differences in both directions:
- rows in
table1
that are not intable2
- rows in
table2
that are not intable1
Notes and caveats:
table1
andtable2
must be of the same width and have columns in the same order and type.- this does not work directly with
STRUCT
orARRAY
data types. You should eitherUNNEST
, or useTO_JSON_STRING
to convert the these data types first. - this does not directly work with
GEOGRAPHY
either, you must cast to text first usingST_AsText
You will need to specify which are the numerical columns, but looking at a representation of all of them will do the fast compare:
#standardSQL
WITH table_a AS (
SELECT 1 id, 2 n1, 3 n2
), table_b AS (
SELECT 1 id, 2 n1, 4 n2
)
SELECT id
FROM table_a a
JOIN table_b b
USING(id)
WHERE TO_JSON_STRING([a.n1, a.n2]) != TO_JSON_STRING([b.n1, b.n2])
First, I want to bring up issues with your original query
The main issues are 1) using LEFT JOIN ; 2) using col != 0
Below is how it should be modified to really capture ALL differences from both tables
Run your original query and below one - and hopefully you will see the difference
#standardSQL
SELECT key1, key2
FROM
(
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2,
table1.column1 - table2.column1 AS col1,
table1.col2 - table2.col2 AS col2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR IFNULL(col2, 1) != 0
or you can just try to run your original and above version against dummy data to see the difference
#standardSQL
WITH `table1` AS (
SELECT 1 key1, 1 key2, 1 column1, 2 col2 UNION ALL
SELECT 2, 2, 3, 4 UNION ALL
SELECT 3, 3, 5, 6
), `table2` AS (
SELECT 1 key1, 1 key2, 1 column1, 29 col2 UNION ALL
SELECT 2, 2, 3, 4 UNION ALL
SELECT 4, 4, 7, 8
)
SELECT key1, key2
FROM
(
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2,
table1.column1 - table2.column1 AS col1,
table1.col2 - table2.col2 AS col2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
)
WHERE IFNULL(col1, 1) != 0
OR IFNULL(col2, 1) != 0
Secondly, below will highly simplify your overall query
#standardSQL
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING(table1) != TO_JSON_STRING(table2)
You can test it with the same dummy data example as above
Note: in this solution you don't need to pick specific columns - it just compare all columns! but if you need to compare only specific columns - you still will need to cherry-pick them like in below example
#standardSQL
SELECT
IFNULL(table1.key1, table2.key1) key1,
IFNULL(table1.key2, table2.key2) key2
FROM `table1` AS table1
FULL OUTER JOIN `table2` AS table2
ON table1.key1 = table2.key1
AND table1.key2 = table2.key2
WHERE TO_JSON_STRING((table1.column1, table1.col2)) != TO_JSON_STRING((table2.column1, table2.col2))