Fastest technique to deleting duplicate data

What about EXISTS:

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)

Can you afford to have the original table unavailable for a short time?

I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.

Then drop the original table and rename the intermediate table to have the same name as the old table.

The bottleneck in bulk row deletion is usually the transaction that SQL Server has to build up. You might be able to speed it up considerably by splitting the removal into smaller transactions. For example, to delete 100 rows at a time:

while 1=1
    begin

    delete top 100
    from sourceTable 
    ...

    if @@rowcount = 0
        break
    end

Fastest technique to deleting duplicate data

Tags:

Sql

Sql Server

Sql Server 2008

Etl

Related

Recent Posts