Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned NodePool Multi-node Consolidation #853

Closed
jashandeep-sohi opened this issue Dec 8, 2023 · 9 comments · May be fixed by #1547
Closed

Partitioned NodePool Multi-node Consolidation #853

jashandeep-sohi opened this issue Dec 8, 2023 · 9 comments · May be fixed by #1547
Labels
consolidation deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. performance Issues relating to performance (memory usage, cpu usage, timing) v1.x Issues prioritized for post-1.0

Comments

@jashandeep-sohi
Copy link

Description

Observed Behavior:

I have a few NodePools that I'm using in a "partitioned" manner. Basically, each NodePool is made independent using user-defined requirements & taints, and Pods in different namespaces use different Nodepools.

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: one
spec:
   taints:
        - key: example.com/partition-key
          value: one
          effect: NoSchedule
   requirements:
    - key: example.com/partition-key
      operator: In
      values: ["one"]
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: two
spec:
   taints:
        - key: example.com/partition-key
          value: two
          effect: NoSchedule
   requirements:
    - key: example.com/partition-key
      operator: In
      values: ["two"]

This works fine for the most part, but I'm observing issues with multi-node consolidation.

As far as I can tell, multi-node consolidation looks at all deprovisionable Nodes together:
https://github.com/kubernetes-sigs/karpenter/blob/cc54b340f630b46a26d19a3cbd49d90c8b3a6d45/pkg/controllers/disruption/multinodeconsolidation.go#L44C42-L44C42

Which I think means there's no multi-node consolidation happening (or it's sub-optimal at best). Shouldn't this be done on groups of compatible NodePools independently?

Another place where I think this is a problem is when simulating the scheduling you look at all Pending Pods:
https://github.com/kubernetes-sigs/karpenter/blob/cc54b340f630b46a26d19a3cbd49d90c8b3a6d45/pkg/controllers/disruption/helpers.go#L97C35-L97C35

But if one of those Pending Pods is not compatible with the firstN candidates chosen from all Nodes, then simulation will always complain about unschedulable Pods (highly likely as the number of nodes/paritions increase)

Expected Behavior:

NodePools should be consolidated in groups computed based on their requirements or based on some configurable partition key.

Reproduction Steps (Please include YAML): See above

Versions:

  • Chart Version: v0.33.0
  • Kubernetes Version (kubectl version): 1.27
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jashandeep-sohi jashandeep-sohi added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 8, 2023
@jonathan-innis
Copy link
Member

This works fine for the most part, but I'm observing issues with multi-node consolidation

We've talked about this a little bit for deprovisioning improvements for the exact reason that you called out: If we order without considering a NodePool boundary, we have some potential to get stuck if those boundaries are independent of each other. There's potential to perhaps try another form of multi-node consolidation where we perform n different multi-node consolidations where n is the number of NodePools on the cluster.

Also, I'm curious: Are you still seeing single node consolidation? I would expect that if we were failing to multi-node consolidaste, we would still attempt to single node consolidate if there are nodes available.

@jonathan-innis jonathan-innis added performance Issues relating to performance (memory usage, cpu usage, timing) kind/feature v1.x Issues prioritized for post-1.0 and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Dec 9, 2023
@jashandeep-sohi
Copy link
Author

Also, I'm curious: Are you still seeing single node consolidation? I would expect that if we were failing to multi-node consolidaste, we would still attempt to single node consolidate if there are nodes available.

I remember this being an issue on an older version, but I'm not seeing it getting blocked on single-node consolidation anymore. I think before it would just short-circuit if there were any Pods in pending state.

There's potential to perhaps try another form of multi-node consolidation where we perform n different multi-node consolidations where n is the number of NodePools on the cluster.

But what if there are 2+ NodePools in each "partition"? For example, with a different NodePool.spec.weight to achieve some kind of priority within a partition. Would doing consolidation on them independently still work? I would think it has to be done on groups of compatible NodePools.

Has introducing the concept of NodePool groups as an explicit API ever been discussed? Something like having a NodePool.spec.group key might make it easier to figure out those boundaries. Figuring it out based on NodePool.spec.template.spec.requirements is probably computationally expensive.
Also, would be nice if group was non-nil, the controller would auto-magically inject taints, labels & requirements like karpenter.sh/nodepool-group={group}. I guess what I'm asking for is first-class support for partitioning.

@njtran njtran added consolidation deprovisioning Issues related to node deprovisioning labels Dec 11, 2023
@jonathan-innis jonathan-innis added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/feature labels Dec 26, 2023
@GnatorX
Copy link

GnatorX commented Mar 7, 2024

#488 I think similar thought and perf improvement.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 5, 2024
@cnmcavoy
Copy link
Contributor

I've been looking at multi-node consolidation and I also discovered these findings independently. The consolidation occurs across all nodepools, which is not ideal if you have very different nodepool configurations in your cluster. If the nodepools are similar, I could see this being a positive, but that is not how our clusters are designed.

There is one other factor that we found that was equally significant if not more significant, multi-node consolidation also does not take into account node architectures. So an amd64 node may be consolidated with a arm node. This in theory could succeed if the workloads were all multi-arch compatible, but that is not the case in our workload clusters, so this consolidation also always fails.

So the combination of nodepool mixing + architecture mixing means multi-node consolidation effectively never finds a successful simulation in our clusters.

@cnmcavoy
Copy link
Contributor

/remove-lifecycle rotten

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 11, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consolidation deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. performance Issues relating to performance (memory usage, cpu usage, timing) v1.x Issues prioritized for post-1.0
Projects
None yet
7 participants