Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected false Requeue in result.Min #1889

Open
ilyee opened this issue Dec 19, 2024 · 5 comments
Open

Unexpected false Requeue in result.Min #1889

ilyee opened this issue Dec 19, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@ilyee
Copy link

ilyee commented Dec 19, 2024

Description

Observed Behavior:
If nodeclaim's initialized condition is false and registered condition is true, here all the items in results will have a zero RequeueAfter, and this function will throw a result with Requeue is false and RequeueAfter is zero.

func Min(results ...reconcile.Result) (result reconcile.Result) {
min := time.Duration(math.MaxInt64)
for _, r := range results {
if r.IsZero() {
continue
}
if r.RequeueAfter < min {
min = r.RequeueAfter
result.RequeueAfter = min
result.Requeue = true
}
}
return
}

Expected Behavior:
maybe we should set Requeue to true

Reproduction Steps (Please include YAML):

Versions:

  • Chart Version: 0.36.0
  • Kubernetes Version (kubectl version): 1.30.0
@ilyee ilyee added the kind/bug Categorizes issue or PR as related to a bug. label Dec 19, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 19, 2024
@ilyee ilyee closed this as completed Dec 19, 2024
@ilyee ilyee reopened this Dec 19, 2024
@ilyee ilyee changed the title Unexpected MaxInt64 wait time in result.Min Unexpected false Requeue in result.Min Dec 19, 2024
@jmdeal
Copy link
Member

jmdeal commented Dec 19, 2024

Why should this change be made? If initialization is false, that means some condition on the Node has not yet been satisfied (e.g. missing resources, startup-taints present, etc). For those conditions to be satisfied, an update must be performed against the node (out-of-band of Karpenter) which will trigger the reconciler. Is there a case that this doesn't hold true for?

@jmdeal
Copy link
Member

jmdeal commented Dec 19, 2024

/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2024
@ilyee
Copy link
Author

ilyee commented Dec 20, 2024

Is there a case that this doesn't hold true for?

Yes, in fact I'm using kwok as cloud provider to benchmark karpenter, when I'm trying to scale nodeclaims up to 2000 at one time this issue happened, it seems that controller missed node readiness event and the nodeclaim won't be reconciled again, so the nodeclaim will stay in not ready condition

@jmdeal
Copy link
Member

jmdeal commented Dec 20, 2024

Interesting, are you able to reproduce this consistently or was it a one off? I'm wondering if there's some other condition which caused the Node to not be considered initialized or if we really are missing the event. If we're missing the event, we should figure out if it's due to the predicates in the registration function, or an actual issue in controller-runtime / client-go (less likely). I don't think we want to jump ahead and just add a requeue though until we understand the root cause.

@ilyee
Copy link
Author

ilyee commented Dec 24, 2024

Interesting, are you able to reproduce this consistently or was it a one off?

It's reproducable, here is my reproduction process:

  1. running karpenter(v0.36.0) with kwok provider
  2. create a deploy with 0 replicas with large resource claim to ensure that for each node only one pod can be scheduled
  3. scale up the deploy to over 2000 replicas in one time

then up to 2000 node will be created and in ready status as expected, but some nodeclaims will stay in not ready forever

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

3 participants