- Published on
How to Monitor a Amazon Elasticsearch Service Cluster Update Process
- Authors
- Name
- Ruan Bekker
- @ruanbekker
When you make a configuration change on Amazon's Elasticsearch, it does a blue/green deployment. So new nodes will be allocated to the cluster (which you will notice from CloudWatch when looking at the nodes metrics). Once these nodes are deployed, data gets copied accross to the new nodes, and traffic gets directed to the new nodes, and once its done, the old nodes gets terminated.
Note: While there will be more nodes in the cluster, you will not get billed for the extra nodes.
While this process is going, you can monitor your cluster to see the progress:
The Shards API:
Using the /_cat/shards
API, you will find that the shards are in a RELOCATING state (keeping in mind, this is when the change is still busy)
curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cat/shards?v' | grep -v 'STARTED'
index shard prirep state docs store ip node
example-app1-2018.02.23 4 r RELOCATING 323498 1018.3mb x.x.x.x x2mKoe_ -> x.x.x.x GyNiRJyeSTifN_9JZisGuQ GyNiRJy
example-app1-2018.02.28 2 p RELOCATING 477609 1.5gb x.x.x.x x2mKoe_ -> x.x.x.x sOihejw1SrKtag_LO1RGIA sOihejw
example-app1-2018.03.01 3 r RELOCATING 463143 1.5gb x.x.x.x ZZfv-Ha -> x.x.x.x jOchdCZWQq-TAPZNTadNoA jOchdCZ
fortinet-syslog-2018.02 0 p RELOCATING 1218556 462.2mb x.x.x.x moQA57Y -> x.x.x.x sOihejw1SrKtag_LO1RGIA sOihejw
example-app1-2018.03.23 3 r RELOCATING 821254 2.4gb x.x.x.x moQA57Y -> x.x.x.x GyNiRJyeSTifN_9JZisGuQ GyNiRJy
example-app1-2018.04.02 2 p RELOCATING 1085279 3.4gb x.x.x.x x2mKoe_ -> x.x.x.x jOchdCZWQq-TAPZNTadNoA jOchdCZ
example-app1-2018.02.08 3 p RELOCATING 136321 125mb x.x.x.x ZUZSFWu -> x.x.x.x tyU_V_KLS5mZXEwnF-YEAQ tyU_V_K
fortinet-syslog-2018.04 4 r RELOCATING 7513842 2.8gb x.x.x.x ZZfv-Ha -> x.x.x.x il1WsroNSgGmXJugds_aMQ il1Wsro
example-app1-2018.04.09 1 r RELOCATING 1074581 3.5gb x.x.x.x ZRzKGe5 -> x.x.x.x il1WsroNSgGmXJugds_aMQ il1Wsro
example-app1-2018.04.09 0 p RELOCATING 1074565 3.5gb x.x.x.x moQA57Y -> x.x.x.x tyU_V_KLS5mZXEwnF-YEAQ tyU_V_K
The Recovery API:
We can then use the /_cat/recovery
API, which will show the progress of the shards transferring to the other nodes, you will find the following:
index, shard, time, type, stage, source_host, target_host, files, files_percent, bytes, bytes_percent
As Amazon masks their node ip addresses, we will find that the ips are not available. To make it more human readable, we will only pass the columns that we are interested in and not to show the shards that has been set to done
:
$ curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cat/recovery?v&h=i,s,t,ty,st,shost,thost,f,fp,b,bp' | grep -v 'done'
i s t ty st shost thost f fp b bp
example-app1-2018.04.11 1 2m peer index x.x.x.x x.x.x.x 139 97.1% 3435483673 65.9%
web-syslog-2018.04 4 7.6m peer finalize x.x.x.x x.x.x.x 109 100.0% 2854310892 100.0%
example-app1-2018.04.16 3 2.9m peer translog x.x.x.x x.x.x.x 130 100.0% 446180036 100.0%
example-app1-2018.03.30 3 2.1m peer index x.x.x.x x.x.x.x 127 97.6% 3862498583 62.5%
example-app1-2018.04.01 0 4.4m peer index x.x.x.x x.x.x.x 140 99.3% 3410543270 87.9%
example-app1-2018.04.06 0 5.1m peer index x.x.x.x x.x.x.x 128 97.7% 4291421948 66.3%
example-app1-2018.04.07 0 52.2s peer index x.x.x.x x.x.x.x 149 91.9% 3969581277 27.4%
network-capture-2018.04.01 2 11.4s peer index x.x.x.x x.x.x.x 107 95.3% 359987163 55.0%
example-app1-2018.03.17 1 1.7m peer index x.x.x.x x.x.x.x 117 98.3% 2104196548 74.5%
example-app1-2018.02.25 3 58.4s peer index x.x.x.x x.x.x.x 102 98.0% 945437614 74.7%
We can also see the human readable output, which is displayed in json format, with much more detail:
$ curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/example-app1-2018.04.03/_recovery?human' | python -m json.tool
{
"example-app1-2018.04.03": {
"shards": [
{
"id": 0,
"index": {
"files": {
"percent": "100.0%",
"recovered": 103,
"reused": 0,
"total": 103
},
"size": {
"percent": "100.0%",
"recovered": "3.6gb",
"recovered_in_bytes": 3926167091,
"reused": "0b",
"reused_in_bytes": 0,
"total": "3.6gb",
"total_in_bytes": 3926167091
},
"source_throttle_time": "2m",
"source_throttle_time_in_millis": 121713,
"target_throttle_time": "2.1m",
"target_throttle_time_in_millis": 126170,
"total_time": "7.2m",
"total_time_in_millis": 434142
},
"primary": true,
"source": {
"host": "x.x.x.x",
"id": "ZRzKGe5WSg2SzilZGb3RbA",
"ip": "x.x.x.x",
"name": "ZRzKGe5",
"transport_address": "x.x.x.x:9300"
},
"stage": "DONE",
"start_time": "2018-04-10T19:26:48.668Z",
"start_time_in_millis": 1523388408668,
"stop_time": "2018-04-10T19:34:04.980Z",
"stop_time_in_millis": 1523388844980,
"target": {
"host": "x.x.x.x",
"id": "x2mKoe_GTpe3b1CnXOKisA",
"ip": "x.x.x.x",
"name": "x2mKoe_",
"transport_address": "x.x.x.x:9300"
},
"total_time": "7.2m",
"total_time_in_millis": 436311,
"translog": {
"percent": "100.0%",
"recovered": 0,
"total": 0,
"total_on_start": 0,
"total_time": "1.1s",
"total_time_in_millis": 1154
},
"type": "PEER",
"verify_index": {
"check_index_time": "0s",
"check_index_time_in_millis": 0,
"total_time": "0s",
"total_time_in_millis": 0
}
},
The Cluster Health API:
Amazon restricts most of the /_cluster
API actions, but we can however see the health endpoint, where we can see the number of nodes
, active_shards
, relocating_shards
, number_of_pending_tasks
etc:
$ curl -XGET https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cluster/health?pretty
{
"cluster_name" : "0123456789012:example-elasticsearch-cluster-6",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 16,
"number_of_data_nodes" : 10,
"active_primary_shards" : 803,
"active_shards" : 1606,
"relocating_shards" : 10,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
The Pending Tasks API:
We also have some insights into the /_cat/pending_tasks
API:
$ curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cat/pending_tasks?v'
insertOrder timeInQueue priority source
1757 53ms URGENT shard-started shard id [[network-metrics-2018.04.13][0]], allocation id [Qh91o_OGRX-lFnY8KxYgQw], primary term [0], message [after peer recovery]
Resources:
- https://www.elastic.co/guide/en/elasticsearch/reference/current/cat-recovery.html#cat-recovery
- https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-recovery.html
Thank You
Thanks for reading, feel free to check out my website, and subscribe to my newsletter or follow me at @ruanbekker on Twitter.
- Linktree: https://go.ruan.dev/links
- Patreon: https://go.ruan.dev/patreon
Please feel free to show support by, sharing this post, making a donation, subscribing or reach out to me if you want me to demo and write up on any specific tech topic.