Escalator exposes metrics of scaling, requests, capacity, utilisation, nodes and pods at the /metrics
endpoint.
These metrics can be scraped by a monitoring system such as Prometheus.
It is highly recommended to collect the metrics exposed by Escalator, as it can provide helpful insight into how it is operating as well as when you need to debug it's operation.
You can change which address:port combination the /metrics
endpoint serves at using the --address
flag. By default
it serves the metrics at 0.0.0.0:8080/metrics
.
These are the metrics that Escalator exposes, and are subject to change:
escalator_run_count
: Number of times the controller has checked for cluster state
escalator_node_group_untainted_nodes
: nodes considered by specific node groups that are untaintedescalator_node_group_tainted_nodes
: nodes considered by specific node groups that are taintedescalator_node_group_cordoned_nodes
: nodes considered by specific node groups that are cordonedescalator_node_group_nodes
: nodes considered by specific node groupsescalator_node_group_pods
: pods considered by specific node groupsescalator_node_group_pods_evicted
: pods evicted during a scale down
escalator_node_group_mem_percent
: percentage of util of memoryescalator_node_group_cpu_percent
: percentage of util of cpuescalator_node_group_mem_request
: byte value of node request memescalator_node_group_cpu_request
: milli value of node request cpuescalator_node_group_mem_capacity
: byte value of node capacity memescalator_node_group_cpu_capacity
: milli value of node capacity cpu
escalator_node_group_taint_event
: indicates a scale down eventescalator_node_group_untaint_event
: indicates a scale up eventescalator_node_group_scale_lock
: indicates if the nodegroup is locked from scaling, zero is asserted unlocked, non-zero postivie lockedescalator_node_group_scale_delta
: indicates current scale deltaescalator_node_group_scale_lock_duration
: histogram metric of scale lock durations, 60 second buckets from 1 … 30.escalator_node_group_scale_lock_check_was_locked
: counter of how many time the lock status was probed and found lockedescalator_node_group_node_registration_lag
: histogram metric of how long nodes take to become registered in kube from cloud provider instantiation, 60 second buckets from 1 … 30
escalator_cloud_provider_min_size
: current cloud provider minimum sizeescalator_cloud_provider_max_size
: current cloud provider maximum sizeescalator_cloud_provider_target_size
: current cloud provider target sizeescalator_cloud_provider_size
: current cloud provider size
Included is an example dashboard in grafana-dashboard.json
for use within
Grafana. It provides an overview of what Escalator is currently doing and what it has done over
time.
This is exceptionally helpful when debugging issues with scaling up/down as it shows the overall utilisation for the
node group at a specific time or over time.
Below is a screenshot of the included dashboard:
It is highly recommended to monitor and graph the two utilisation metrics
(escalator_node_group_mem_percent
and escalator_node_group_cpu_percent
) as this will let you see the utilisation that Escalator
calculates. Ideally these values should stay below your scale up threshold.