How to update 10 million+ rows in MySQL single table as Fast as possible?
Based on my experience, I would use LOAD DATA INFILE to import your CSV File.
The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed.
Example I found on the internet Load Data example. I tested this example on my box and worked fine
Example Table
CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
Example CSV File
# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013
Import Statement to be run from MySQL console
LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY '\n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');
Result
MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+
IGNORE just simply ignores the first line which are column headers.
After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.
Here is another example directly from Oracle : LOAD DATA INFILE example
This should be enough to get you started.
In light of all the things mentioned, it looks like the bottleneck is the join itself.
ASPECT #1 : Join Buffer Size
In all likelihood, your join_buffer_size is probably too low.
According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache
We only store the used columns in the join buffer, not the whole rows.
This being the case, make the keys of the join buffer stay in RAM.
You have 10 million rows times 4 bytes for each key. That's about 40M.
Try bumping it up in the session to 42M (a little bigger than 40M)
SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;
If this does the trick, proceed to add this to my.cnf
[mysqld]
join_buffer_size = 42M
Restarting mysqld not required for new connections. Just run
mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;
ASPECT #2 : Join Operation
You could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.
This same page recommends doing this:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
ASPECT #3 : Writing Updates to Disk (OPTIONAL)
Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.
[mysqld]
innodb_write_io_threads = 16
You will have to restart MySQL for this change
GIVE IT A TRY !!!
CREATE TABLE
that matches the CSVLOAD DATA
into that tableUPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;
DROP TABLE csv_table;
Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...
If nothing else is writing to the table, then...
CREATE TABLE
that matches the CSV; no indexes except what is needed in theJOIN
in theUPDATE
. If unique, make itPRIMARY KEY
.LOAD DATA
into that table- copy the
real_table
tonew_table
(CREATE ... SELECT
) UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;
RENAME TABLE real_table TO old, new_table TO real_table;
DROP TABLE csv_table, old;
Step 3 is faster than the update, especially if unnecessary indexes are left off.
Step 5 is "instantaneous".