How to really reindex data in elasticsearch
Elasticsearch Reindex from Remote
host to Local
Host example (Jan 2020 Update)
# show indices on this host
curl 'localhost:9200/_cat/indices?v'
# edit elasticsearch configuration file to allow remote indexing
sudo vi /etc/elasticsearch/elasticsearch.yml
## copy the line below somewhere in the file
>>>
# --- whitelist for remote indexing ---
reindex.remote.whitelist: my-remote-machine.my-domain.com:9200
<<<
# restart elaticsearch service
sudo systemctl restart elasticsearch
# run reindex from remote machine to copy the index named filebeat-2016.12.01
curl -H 'Content-Type: application/json' -X POST 127.0.0.1:9200/_reindex?pretty -d'{
"source": {
"remote": {
"host": "http://my-remote-machine.my-domain.com:9200"
},
"index": "filebeat-2016.12.01"
},
"dest": {
"index": "filebeat-2016.12.01"
}
}'
# verify index has been copied
curl 'localhost:9200/_cat/indices?v'
Re-indexing means to read the data, delete the data in elasticsearch and ingest the data again. There is no such thing like "change the mapping of existing data in place." All the re-indexing tools you mentioned are just wrappers around read->delete->ingest.
You can always adjust the mapping for new indices and add fields later. All the new fields will be indexed with respect to this mapping. Or use dynamic mapping if you are not in control of the new fields.
Have a look at Change default mapping of string to "not analyzed" in Elasticsearch to see how to use dynamic mapping to get not_analyzed fields of strings.
Re-indexing is very expensive. Better way is to create a new index and drop the old one. To achieve this with zero downtime, use index alias for all your customers. Think of an index called "data-version1". In steps:
- create your index "data-version1" and give it an alias named "data"
- only use the alias "data" in all your client applications
- to update your mapping: create a new index (with the new mapping) called "data-version2" and put all your data in (you can use the
_reindex
API for that) - to switch from version1 to version2: drop the alias "data" on version1 and create an alias "data" on version2 (or first create, then drop). the time in between those two steps your clients will have no (or double) data. but the time between dropping and creating an alias should be so short your clients shouldn't recognize it.
It's good practice to always use aliases.
If you want like me a straight answer to this common and basic problem which is poorly adressed by elastic and the community in general, here is the code that works for me.
Assuming you are just debugging, not in a production environment, and it is absolutely legitimate to add or remove fields because you absolutely don't care about downtime or latency:
# First of all: enable blocks write to enable clonage
PUT /my_index/_settings
{
"settings": {
"index.blocks.write": true
}
}
# clone index into a temporary index
POST /my_index/_clone/my_index-000001
# Copy back all documents in the original index to force their reindexetion
POST /_reindex
{
"source": {
"index": "my_index-000001"
},
"dest": {
"index": "my_index"
}
}
# Disable blocks write
PUT /my_index/_settings
{
"settings": {
"index.blocks.write": false
}
}
# Finaly delete the temporary index
DELETE my_index-000001
With version 2.3.4 a new api _reindex is available which will do exactly what it says. Basic usage is
{
"source": {
"index": "currentIndex"
},
"dest": {
"index": "newIndex"
}
}