Endless nodes are created after `expireAfter` elapse on a node in some scenarios #1842

otoupin-nsesi · 2024-11-25T22:30:40Z

Description

Observed Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and endless new nodes are created to try to schedule those pods. Also, pods that don't have PDBs are NOT evicted.

Expected Behavior:

After expireAfter elapse on a node, pods are starting to get evicted, and one node at most is created to schedule those pods. Also, pods that don't have PDBs are evicted. There may be an odd pod that has a PDB preventing the node from getting recycled, but if this is the case we can set terminationGracePeriod.

Reproduction Steps:

Have one CloudNativePG database in the cluster (or a similar workload => single replica & a PDB)
CloudNativePG will add a PDB to the primary.
Have a nodepool with relatively short expiry (expireAfter). In our case we have dev environments set at 24h, so we caught this early.
Once a node expires, a weird behaviour is triggered.
1. As expected, in v1 expiries are now forceful, so Karpenter begins to evict the pods.
2. As expected, a new node is spun out to take up the slack.
3. But then the problems start,
  1. Since there is a PDB on a single replica (there is only one PG primary at the time), eviction is not happening. So far so good (this is also the old behaviour, in v0.37.x the node just can't expire until we restart the database manually (or kill the primary)).
  2. However, any other pods on this node are not evicted either, while the documentation, and the log messages appear to believe it should be the case.
  3. The new node from earlier is nomitated for those pods, but they never transfer to that node, as they are not evicted.
  4. Then at the next batch of pod scheduling, we get found provisionable pod(s) again, and a new nodeclaim is added (for the same pod as earlier)
  5. And again
  6. And again
  7. And again
4. So we end up in a situation where we have a lot of unused nodes, containing only daemonset and new workloads.
At the point, I restart the database, the primary move, the PDB is removed, and everything can then slowly heal. However, there was no sign of "infinite nodeclaim creation" ever ending before.

We believe this is a bug, we couldn't find a workaround (aside from removing expireAfter), and reverted to v0.37.x series for now.

A few clues:
The state of the cluster 30m-45m after expiry. Node 53-23 is the one that expired. Any nodes younger than 30min are running mostly empty (aside from daemonsets).

On the expired node, the pods are nominated to be scheduled on a different node, but as you can see it can never happen.

NOTE: I don't recall 100% if this screenshot was CloudNativePG primary itself or one of its neighbouring pods, but I think so.

And finally the log that appears after every scheduling event saying it found provisionable pod(s) and they precede a new “unnecessary nodeclaim."

karpenter-5d967c944c-k8xb8 {"level":"INFO","time":"2024-11-13T22:47:24.148Z","logger":"controller","message":"found provisionable pod(s)","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"7c981fa7-3071-4de8-87b3-370a15664ba7","Pods":"monitoring/monitoring-grafana-pg-1, kube-system/coredns-58745b69fb-sd222, cnpg-system/cnpg-cloudnative-pg-7667bd696d-lrqvb, kube-system/aws-load-balancer-controller-74b584c6df-fckdn, harbor/harbor-container-webhook-78657f5698-kmmrz","duration":"87.726672ms"}

Versions:

Chart Version: 1.0.7
Kubernetes Version (kubectl version): v1.29.10

Extra:

I would like to build / modify a test case to prove / diagnose this behaviour, any pointer? I've looked at the source code, but I wanted to post this report first to gather feedback.
Any other workaround aside from disabling expireAfter on the node pool?
Finally, in our context this bug is triggered by CloudNativePG primaries, but it would apply to any workload with a single replica and a PDB minAvailable: 1.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-11-25T22:30:48Z

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

danielloader · 2024-12-04T15:50:00Z

#1152 (comment)
cloudnative-pg/cloudnative-pg#5299

Related to OP.

jonathan-innis · 2024-12-10T17:57:03Z

Can you share the PDB that you are using and the StatefulSet/Deployment? From looking at the other thread, it sounds like there may be something else that is blocking Karpenter from performing the eviction that needs integration with Karpenter's down-scaling logic

jonathan-innis · 2024-12-10T17:57:24Z

/traige needs-information

sidewinder12s · 2024-12-10T23:17:00Z

From the linked issue it sounds like they configure a PDB which will forever block pod termination.

I think I am also seeing similar behavior to this with the do-not-evict annotation on pods blocking pod termination. I think you can observe similar running karpenter, create a deployment with topologySpreadConstraint, like 15 replicas and an expireAfter period of like 10m.

I'm using v1.0.8

sidewinder12s · 2024-12-10T23:33:06Z

Actually just tested this again, letting Karpenter run in that configuration with 15 pods blocking node termination put Karpenter into a bad state seemingly unable to scale down nodes with a lot of this waiting on cluster sync message:

{"level":"DEBUG","time":"2024-12-10T23:30:54.287Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"c8f4afc7-527c-491f-bd3c-e73f119dcc30"}
{"level":"DEBUG","time":"2024-12-10T23:30:55.289Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"600f540f-60cb-498b-9b94-c72e4bf5a4d4"}
{"level":"DEBUG","time":"2024-12-10T23:30:56.292Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"7e5a4f3d-c985-40ef-b46b-eca1925ce2ee"}
{"level":"DEBUG","time":"2024-12-10T23:30:57.294Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"9719c8af-4cbd-4a01-bd81-8c57c5a8c482"}
{"level":"DEBUG","time":"2024-12-10T23:30:58.296Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"63b7771e-b55e-4b41-a313-cac4d9ebc53a"}
{"level":"DEBUG","time":"2024-12-10T23:30:58.644Z","logger":"controller","caller":"reconcile/reconcile.go:142","message":"deleting expired nodeclaim","commit":"a2875e3","controller":"nodeclaim.expiration","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"use1-test01-default-spot-kjsx9"},"namespace":"","name":"use1-test01-default-spot-kjsx9","reconcileID":"3224ba1b-82e6-4989-b77f-ea08e798ba2c"}
{"level":"DEBUG","time":"2024-12-10T23:30:59.080Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"provisioner","namespace":"","name":"","reconcileID":"222b8e98-b936-4e43-b696-f8a38ab4f78d"}
{"level":"DEBUG","time":"2024-12-10T23:30:59.297Z","logger":"controller","caller":"singleton/controller.go:26","message":"waiting on cluster sync","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"97b9d15d-133c-4d6e-991d-7b688a48e2ef"}

otoupin-nsesi added the kind/bug Categorizes issue or PR as related to a bug. label Nov 25, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endless nodes are created after `expireAfter` elapse on a node in some scenarios #1842

Endless nodes are created after `expireAfter` elapse on a node in some scenarios #1842

otoupin-nsesi commented Nov 25, 2024 •

edited

Loading

k8s-ci-robot commented Nov 25, 2024

danielloader commented Dec 4, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

sidewinder12s commented Dec 10, 2024 •

edited

Loading

sidewinder12s commented Dec 10, 2024

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Endless nodes are created after expireAfter elapse on a node in some scenarios #1842

Comments

otoupin-nsesi commented Nov 25, 2024 • edited Loading

Description

k8s-ci-robot commented Nov 25, 2024

danielloader commented Dec 4, 2024

jonathan-innis commented Dec 10, 2024

jonathan-innis commented Dec 10, 2024

sidewinder12s commented Dec 10, 2024 • edited Loading

sidewinder12s commented Dec 10, 2024

Endless nodes are created after `expireAfter` elapse on a node in some scenarios #1842

Endless nodes are created after `expireAfter` elapse on a node in some scenarios #1842

otoupin-nsesi commented Nov 25, 2024 •

edited

Loading

sidewinder12s commented Dec 10, 2024 •

edited

Loading