How to Monitor a Amazon Elasticsearch Service Cluster Update Process

When you make a configuration change on Amazon's Elasticsearch, it does a blue/green deployment. So new nodes will be allocated to the cluster (which you will notice from CloudWatch when looking at the nodes metrics). Once these nodes are deployed, data gets copied accross to the new nodes, and traffic gets directed to the new nodes, and once its done, the old nodes gets terminated.

Note: While there will be more nodes in the cluster, you will not get billed for the extra nodes.

While this process is going, you can monitor your cluster to see the progress:

The Shards API:

Using the /_cat/shards API, you will find that the shards are in a RELOCATING state (keeping in mind, this is when the change is still busy)

curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cat/shards?v' | grep -v 'STARTED'
index                                   shard prirep state         docs    store ip            node
example-app1-2018.02.23                 4     r      RELOCATING  323498 1018.3mb x.x.x.x x2mKoe_ -> x.x.x.x GyNiRJyeSTifN_9JZisGuQ GyNiRJy
example-app1-2018.02.28                 2     p      RELOCATING  477609    1.5gb x.x.x.x x2mKoe_ -> x.x.x.x sOihejw1SrKtag_LO1RGIA sOihejw
example-app1-2018.03.01                 3     r      RELOCATING  463143    1.5gb x.x.x.x  ZZfv-Ha -> x.x.x.x jOchdCZWQq-TAPZNTadNoA jOchdCZ
fortinet-syslog-2018.02                 0     p      RELOCATING 1218556  462.2mb x.x.x.x  moQA57Y -> x.x.x.x sOihejw1SrKtag_LO1RGIA sOihejw
example-app1-2018.03.23                 3     r      RELOCATING  821254    2.4gb x.x.x.x  moQA57Y -> x.x.x.x GyNiRJyeSTifN_9JZisGuQ GyNiRJy
example-app1-2018.04.02                 2     p      RELOCATING 1085279    3.4gb x.x.x.x x2mKoe_ -> x.x.x.x jOchdCZWQq-TAPZNTadNoA jOchdCZ
example-app1-2018.02.08                 3     p      RELOCATING  136321    125mb x.x.x.x ZUZSFWu -> x.x.x.x tyU_V_KLS5mZXEwnF-YEAQ tyU_V_K
fortinet-syslog-2018.04                 4     r      RELOCATING 7513842    2.8gb x.x.x.x  ZZfv-Ha -> x.x.x.x il1WsroNSgGmXJugds_aMQ il1Wsro
example-app1-2018.04.09                 1     r      RELOCATING 1074581    3.5gb x.x.x.x  ZRzKGe5 -> x.x.x.x il1WsroNSgGmXJugds_aMQ il1Wsro
example-app1-2018.04.09                 0     p      RELOCATING 1074565    3.5gb x.x.x.x  moQA57Y -> x.x.x.x tyU_V_KLS5mZXEwnF-YEAQ tyU_V_K

The Recovery API:

We can then use the /_cat/recovery API, which will show the progress of the shards transferring to the other nodes, you will find the following:

index, shard, time, type, stage, source_host, target_host, files, files_percent, bytes, bytes_percent

As Amazon masks their node ip addresses, we will find that the ips are not available. To make it more human readable, we will only pass the columns that we are interested in and not to show the shards that has been set to done:

$ curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cat/recovery?v&h=i,s,t,ty,st,shost,thost,f,fp,b,bp' | grep -v 'done'
i                                       s t     ty          st       shost         thost         f   fp     b          bp
example-app1-2018.04.11                 1 2m    peer        index    x.x.x.x x.x.x.x  139 97.1%  3435483673 65.9%
web-syslog-2018.04                 4 7.6m  peer        finalize x.x.x.x x.x.x.x  109 100.0% 2854310892 100.0%
example-app1-2018.04.16                 3 2.9m  peer        translog x.x.x.x x.x.x.x  130 100.0% 446180036  100.0%
example-app1-2018.03.30                 3 2.1m  peer        index    x.x.x.x  x.x.x.x  127 97.6%  3862498583 62.5%
example-app1-2018.04.01                 0 4.4m  peer        index    x.x.x.x  x.x.x.x  140 99.3%  3410543270 87.9%
example-app1-2018.04.06                 0 5.1m  peer        index    x.x.x.x x.x.x.x  128 97.7%  4291421948 66.3%
example-app1-2018.04.07                 0 52.2s peer        index    x.x.x.x x.x.x.x 149 91.9%  3969581277 27.4%
network-capture-2018.04.01               2 11.4s peer        index    x.x.x.x  x.x.x.x 107 95.3%  359987163  55.0%
example-app1-2018.03.17                 1 1.7m  peer        index    x.x.x.x  x.x.x.x 117 98.3%  2104196548 74.5%
example-app1-2018.02.25                 3 58.4s peer        index    x.x.x.x  x.x.x.x 102 98.0%  945437614  74.7%

We can also see the human readable output, which is displayed in json format, with much more detail:

$ curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/example-app1-2018.04.03/_recovery?human' | python -m json.tool
{
    "example-app1-2018.04.03": {
        "shards": [
            {
                "id": 0,
                "index": {
                    "files": {
                        "percent": "100.0%",
                        "recovered": 103,
                        "reused": 0,
                        "total": 103
                    },
                    "size": {
                        "percent": "100.0%",
                        "recovered": "3.6gb",
                        "recovered_in_bytes": 3926167091,
                        "reused": "0b",
                        "reused_in_bytes": 0,
                        "total": "3.6gb",
                        "total_in_bytes": 3926167091
                    },
                    "source_throttle_time": "2m",
                    "source_throttle_time_in_millis": 121713,
                    "target_throttle_time": "2.1m",
                    "target_throttle_time_in_millis": 126170,
                    "total_time": "7.2m",
                    "total_time_in_millis": 434142
                },
                "primary": true,
                "source": {
                    "host": "x.x.x.x",
                    "id": "ZRzKGe5WSg2SzilZGb3RbA",
                    "ip": "x.x.x.x",
                    "name": "ZRzKGe5",
                    "transport_address": "x.x.x.x:9300"
                },
                "stage": "DONE",
                "start_time": "2018-04-10T19:26:48.668Z",
                "start_time_in_millis": 1523388408668,
                "stop_time": "2018-04-10T19:34:04.980Z",
                "stop_time_in_millis": 1523388844980,
                "target": {
                    "host": "x.x.x.x",
                    "id": "x2mKoe_GTpe3b1CnXOKisA",
                    "ip": "x.x.x.x",
                    "name": "x2mKoe_",
                    "transport_address": "x.x.x.x:9300"
                },
                "total_time": "7.2m",
                "total_time_in_millis": 436311,
                "translog": {
                    "percent": "100.0%",
                    "recovered": 0,
                    "total": 0,
                    "total_on_start": 0,
                    "total_time": "1.1s",
                    "total_time_in_millis": 1154
                },
                "type": "PEER",
                "verify_index": {
                    "check_index_time": "0s",
                    "check_index_time_in_millis": 0,
                    "total_time": "0s",
                    "total_time_in_millis": 0
                }
            },

The Cluster Health API:

Amazon restricts most of the /_cluster API actions, but we can however see the health endpoint, where we can see the number of nodes, active_shards, relocating_shards, number_of_pending_tasks etc:

$ curl -XGET https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cluster/health?pretty
{
  "cluster_name" : "0123456789012:example-elasticsearch-cluster-6",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 16,
  "number_of_data_nodes" : 10,
  "active_primary_shards" : 803,
  "active_shards" : 1606,
  "relocating_shards" : 10,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

The Pending Tasks API:

We also have some insights into the /_cat/pending_tasks API:

$ curl -s -XGET 'https://search-example-elasticsearch-cluster-6-abc123defghijkl5airxticzvjaqy.eu-west-1.es.amazonaws.com/_cat/pending_tasks?v'
insertOrder timeInQueue priority source
1757        53ms URGENT   shard-started shard id [[network-metrics-2018.04.13][0]], allocation id [Qh91o_OGRX-lFnY8KxYgQw], primary term [0], message [after peer recovery]

Resources:

Thank You

Thanks for reading, feel free to check out my website, and subscribe to my newsletter or follow me at @ruanbekker on Twitter.

Linktree: https://go.ruan.dev/links
Patreon: https://go.ruan.dev/patreon

Please feel free to show support by, sharing this post, making a donation, subscribing or reach out to me if you want me to demo and write up on any specific tech topic.