Is there a smarter way to reindex elasticsearch?
Wrote up a blog post about how I handled reindexing with no downtime recently. Takes some time to figure out all the little things that need to be in place to do so. Hope this helps!
https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html
To summarize:
Step 1: Prepare New Index
Create your new index with your new mapping. This can be on the same instance of Elasticsearch or on a brand new instance.
Step 2: Keep Indexes Up To Date
While you're reindexing you want to keep both your new and old indexes up to date. For a write operation, this can be done by sending the write operation to a background worker on both the new and old index.
Deletes are a bit trickier because there is a race condition between deleting and reindexing the record into the new index. So, you'll want to keep track of the records that need to be deleted during your reindex and process these when you are finished. If you aren't performing many deletes, another way would be to eliminate the possibility of a delete during your reindex.
Step 3: Perform Reindexing
You’ll want to use a scrolled search for reading the data and bulk API for inserting. Since after Step 2 you'll be writing new and updated documents to the new index in the background, you want to make sure you do NOT update existing documents in the new index with your bulk API requests.
This means that the operation you want for your bulk API requests is create, not index. From the documentation: “create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary”. The main point here is you do not want old data from the scrolled search snapshot to overwrite new data in the new index.
There's a great script on github to help you with this process: es-reindex.
Step 4: Switch Over
Once you’re finished reindexing, it’s time to switch your search over to the new index. You’ll want to turn deletes back on or process the enqueued delete jobs for the new index. You may notice that searching the new index is a bit slow at first. This is because Elasticsearch and the JVM need time to warm up.
Perform any code changes you need so your application starts searching the new index. You can continue writing to the old index incase you run into problems and need to rollback. If you feel this is unnecessary, you can stop writing to it.
Step 5: Clean Up
At this point you should be completely transitioned to the new index. If everything is going well, perform any necessary cleanup such as:
- Delete the old index host if it’s different from the new
- Remove serialization code related to your old index
Yes, there are smarter ways how to re-index your data without downtime.
First, never, ever use the "final" index name as your real index name. So, if you'd like to name your index "articles", don't use that name as a physical index, but create an index such as "articles-2012-12-12" or "articles-A", "articles-1", etc.
Second, create an alias "alias" pointing to that index. Your application will then use this alias, so you'll never need to manually change the index name, restart the application, etc.
Third, when you want or need to re-index the data, re-index them into a different index, let's say "articles-B" -- all the tools in Tire's indexing toolchaing support you here.
When you're done, point the alias to the new index. In this way, you not only minimize downtime (there isn't any), you also have a safe snapshot: if you somehow mess up the indexing into the new index, you can just switch back to the old one, until you resolve the issue.
I think @karmi makes it right. However let me explain it a bit simpler. I needed to occasionally upgrade production schema with some new properties or analysis settings. I recently started to use the scenario described below to do live, constant load, zero-downtime index migrations. You can do that remotely.
Here are steps:
Assumptions:
- You have index
real1
and aliasesreal_write
,real_read
pointing to it, - the client writes only to
real_write
and reads only fromreal_read
, _source
property of document is available.
1. New index
Create real2
index with new mapping and settings of your choice.
2. Writer alias switch
Using following bulk query switch write alias.
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_write" } },
{ "add" : { "index" : "real2", "alias" : "real_write" } }
]
}'
This is atomic operation. From this time real2
is populated with new client's data on all nodes. Readers still use old real1
via real_read
. This is eventual consistency.
3. Old data migration
Data must be migrated from real1
to real2
, however new documents in real2
can't be overwritten with old entries. Migrating script should use bulk
API with create
operation (not index
or update
). I use simple Ruby script es-reindex which has nice E.T.A. status:
$ ruby es-reindex.rb http://esserver:9200/real1 http://esserver:9200/real2
UPDATE 2017 You may consider new Reindex API instead of using the script. It has lot of interesting features like conflicts reporting etc.
4. Reader alias switch
Now real2
is up to date and clients are writing to it, however they are still reading from real1
. Let's update reader alias:
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_read" } },
{ "add" : { "index" : "real2", "alias" : "real_read" } }
]
}'
5. Backup and delete old index
Writes and reads go to real2
. You can backup and delete real1
index from ES cluster.
Done!