Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Karpenter Creates New NodeClaim Before Deleting Existing Node for Consolidation #1879

Open
EdwinPhilip opened this issue Dec 13, 2024 · 1 comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@EdwinPhilip
Copy link

EdwinPhilip commented Dec 13, 2024

Description

What problem are you trying to solve?
Currently, when consolidating underutilized or empty nodes, Karpenter does not create new NodeClaims before initiating the deletion or draining of existing nodes. This can lead to a temporary loss of capacity, causing disruptions to workloads, especially when there are no spare nodes available in the cluster.

This behavior poses challenges for workloads that require high availability or have strict scheduling constraints, as pods may remain in a pending state until new nodes are provisioned.

Proposed Behavior
When consolidating nodes, Karpenter should:

  • Preemptively create a new NodeClaim to ensure sufficient capacity is available.
  • Wait for the new node to reach the Ready state before initiating the deletion or draining of the existing node.
  • Provide a configurable option (e.g., waitForReadyBeforeConsolidation) to enable or disable this behavior, allowing users to choose between faster consolidation and safer capacity transitions.

Use Case

  • Workload Impact: High-availability applications or latency-sensitive workloads can experience disruptions during consolidation if nodes are deleted before replacements are ready.
  • Cluster Stability: In clusters with minimal buffer capacity, this behavior can lead to pending pods and degraded service availability.
  • Spot Instances: Spot interruptions are unpredictable and draining nodes before having a new node available will cause pods to be unavailable, potentially causing application degradation

Steps to Reproduce

  • Deploy a Karpenter-managed cluster with minimal buffer capacity.
  • Trigger a consolidation event where underutilized nodes are identified for deletion.
  • Observe that Karpenter deletes nodes before ensuring new nodes are provisioned and ready.

Expected Behavior

  • New nodes should be provisioned and become Ready before existing nodes are deleted during consolidation.
    Workloads should remain unaffected during node consolidation events.

Potential Solutions

  • Introduce a configuration option like waitForReadyBeforeConsolidation: true at NodePool level to enable this behavior.

How important is this feature to you?
This feature would enhance cluster stability and workload availability, making Karpenter a more robust solution for production environments.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@EdwinPhilip EdwinPhilip added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 13, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 13, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

2 participants