Postgres speed up index creation for large table
Before I posted this question, I had two tests running but I wasn't sure if they would ever finish. I was hoping to get some insight into how to speed it up if they never finished. Thankfully, one of the tests did finish. I also got good results by trying the suggestions that jjanes posted.
Using Postgres 9, I ran two tests:
- Inserting the 2 billion entries with indexes
- Building indexes for a table that already had the 2 billion entries inserted
The first test has been running for a week and has only inserted 20% of the 2 billion entries. Inserting with index is definitely not the way to go.
The second test took 45 hours. Without knowing how long index creation was going to take, I prematurely cut off some of the previous tests and never got to see them complete. It takes a long time but for now, I'm happy that it doesn't run on forever.
Taking jjanes suggestions, I upgraded to Postgres 11, set the parallel workers to 8 and set the maintenance work memory to 7GB. I followed this guide to pick the parameters: https://www.cybertec-postgresql.com/en/postgresql-parallel-create-index-for-better-performance/. This sped up the index creation and it only took 35 hours. Shaving off 10 hours was a nice surprise. Thanks!
The best tuning method for creating indexes is a very high value for maintenance_work_mem
.
If you just add a few million rows to a 1.5 TB table, dropping and recreating the indexes will probably be slower. Youd have to run tests to determine the point where it starts to be faster.
PostgreSQL added the ability to parallelize a single index build in version 10, so you could consider an upgrade to make use of this feature.
Even the version you are currently using, you can create multiple indexes on the same table at the same time (using different database connections to invoke each one), so you could still get some degree of informal parallelization going on. You can only do this with regular index builds, which lock out changes to the table while the index is being built. The "concurrently" index build allows UPDATE/INSERT/DELETE to run concurrently with the index build, but do not allow index builds on the same table to be concurrent with each other.
In my experience, AWS's virtual CPUs are virtually useless. While db.m4.4xlarge reports 16 vCPU, there is no point in parallelizing beyond 8. If you want to make use of parallelism (either v10 single-index, or by just building multiple indexes at the same time) you probably need a smaller value for maintenance_work_mem than 15GB.
Would it be faster to just insert the data without dropping the indexes?
At some point, definitely. It depends on how many records are in your bulk inserts, plus a bunch of other things, and can't be answered just by thinking about it. It is an experimental question.