From 8aa8285f4a58c147a1ab1da579a7bf750fc95d0f Mon Sep 17 00:00:00 2001 From: togashidm Date: Mon, 24 Jan 2022 09:20:31 +0000 Subject: [PATCH] Add the Strategy Labeling example doc --- telemetry-aware-scheduling/README.md | 2 +- .../docs/strategy-labeling-example.md | 407 ++++++++++++++++++ 2 files changed, 408 insertions(+), 1 deletion(-) create mode 100644 telemetry-aware-scheduling/docs/strategy-labeling-example.md diff --git a/telemetry-aware-scheduling/README.md b/telemetry-aware-scheduling/README.md index 501f2fe0..4d1ef0cc 100644 --- a/telemetry-aware-scheduling/README.md +++ b/telemetry-aware-scheduling/README.md @@ -182,7 +182,7 @@ There can be four strategy types in a policy file and rules associated with each The above rules would create label `telemetry.aware.scheduling.scheduling-policy/foo=1` when `node_metric_1` is greater than `node_metric_2` and also greater than 100. If instead `node_metric_2` would be greater than `node_metric_1` and also greater than 100, the produced label would be `telemetry.aware.scheduling.scheduling-policy/foo=2`. If neither metric would be greater than 100, no label would be created. When there are multiple candidates with equal values, the resulting label is - random among the equal candidates. Label cleanup happens automatically. + random among the equal candidates. Label cleanup happens automatically. An example of the labeling strategy can be found in [here](docs/strategy-labeling-example.md) dontschedule and deschedule - which incorporate multiple rules - function with an OR operator. That is if any single rule is broken the strategy is considered violated. Telemetry policies are namespaced, meaning that under normal circumstances a workload can only be associated with a pod in the same namespaces. diff --git a/telemetry-aware-scheduling/docs/strategy-labeling-example.md b/telemetry-aware-scheduling/docs/strategy-labeling-example.md new file mode 100644 index 00000000..fcfbe131 --- /dev/null +++ b/telemetry-aware-scheduling/docs/strategy-labeling-example.md @@ -0,0 +1,407 @@ +# Labeling Strategy Example +This guide shows how to implement a labeling strategy in the TAS policy and having an application that runs within that policy. +In the [Health metric demo](https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/docs/health-metric-example.md#setting-the-health-metric), it is shown how the *deschedule* strategy works when policy strategy rules are violated i.e., it marks the node with a pre-defined label "violating". This allows the k8s Descheduler to evict all the pod in that node. The *deschedule* strategy, however, is not flexible in relation to labels that can be used to mark the node. +The addition of *labeling* into the available strategies gives the desired flexibility to go beyond a single fixed key:value pair such as "policyName: violating". The *labeling* strategy gives extra support for pods/workloads when specific physical devices/resources per node is required. This is then achieved, by linking the policy rules and the labeling capacity to the specific node resources and the evaluation of their metrics values. +In this demo, a simple case is exemplified when a node is labeled by a customized label as the policy rule is broken. Also, we verify that running Pods, in the nodes whose metrics are no longer obeying the policy rules, are evicted by the K8s Descheduler. +This guide requires at least two worker nodes in a Kubernetes cluster set-up with user permission level, and it needs the [TAS](https://github.com/intel/platform-aware-scheduling/tree/master/telemetry-aware-scheduling#deploy-tas) and the [custom metrics](https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/docs/custom-metrics.md#quick-install) pipeline to be running. + +### Setting the metrics +The metrics that will be scraped by node-exporter can be set in each of the files. Please add the metrics name and values in /tmp/node-metrics/metrics.prom per worker node. + +In node1: +````node_metric_card0 200```` + +In node2: +````node_metric_card1 90```` + +The values can be changed via a shell script like one applied to the [health metric demo](https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/docs/health-metric-example.md#setting-the-health-metric). Note that the script assumes a user level with write access to /tmp/ and will require a password, or ssh key, to log into the nodes to set the metric. +Any change in the value in the respective file for each node will be read by the Prometheus Node Exporter, and will propagate through the metrics pipeline to be accessible by TAS. +If the metric is being picked up properly by the custom metrics API it will return on the command for metric_card1: + +````kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/nodes/*/metric_card1" | jq .```` + +Note, it may take some time for the metric to be initially scraped. + + +### Deploy a Telemetry Policy + +```` +cat < +demo-app-label-68b4b587f9-jlt2v 0/1 Pending 12s +demo-app-label-68b4b587f9-mtcjh 0/1 Pending 12s +demo-app-label-68b4b587f9-629qf 0/1 Pending 12s +demo-app-label-68b4b587f9-bcd5l 0/1 Pending 12s +demo-app-label-68b4b587f9-sdzmc 1/1 Running 2m15s node1 +demo-app-label-68b4b587f9-8xgjd 1/1 Running 2m15s node1 +demo-app-label-68b4b587f9-99bwb 1/1 Running 2m15s node1 +demo-app-label-68b4b587f9-gftzs 1/1 Running 2m15s node1 +demo-app-label-68b4b587f9-2g96q 1/1 Running 2m15s node1 +demo-app-label-68b4b587f9-hw88m 1/1 Running 5m13s node2 +demo-app-label-68b4b587f9-gjh2n 1/1 Running 5m13s node2 +demo-app-label-68b4b587f9-8sxv7 1/1 Running 5m13s node2 +demo-app-label-68b4b587f9-w69hb 1/1 Running 5m24s node2 +demo-app-label-68b4b587f9-xjll4 1/1 Running 5m13s node2 +```` + +Once the metric changes for a given node, and it returns to a schedulable condition, i.e., non-violating the labeling rules, then the workloads will be scheduled to run at the node referred. + + +### Descheduler +[Kubernetes Descheduler](https://github.com/kubernetes-sigs/descheduler) allows control of pod evictions in the cluster after being bound to a node. Descheduler, based on its policy, finds pods that can be moved and evicted. There are many ways to install and run the K8s [Descheduler](https://github.com/kubernetes-sigs/descheduler#quick-start). Here, we have executed it as a [deployment](https://github.com/kubernetes-sigs/descheduler#run-as-a-deployment). +In a shell terminal, deploy the Descheduler files: + +```` +kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/descheduler/master/kubernetes/base/rbac.yaml +```` +```` +kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/descheduler/master/kubernetes/deployment/deployment.yaml +```` +```` +cat <