-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to controller-runtime 0.19.1 / Kube 1.31 #293
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrstf The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for k8s-prow ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/test all |
1 similar comment
/test all |
/test all |
9102251
to
97a1840
Compare
97a1840
to
2fc8f26
Compare
/test all |
/test all |
f1a2136
to
68c7dc4
Compare
/test all |
/cc |
pkg/scheduler/reconciler_test.go
Outdated
@@ -61,11 +61,11 @@ func (ft *fakeTracker) Get(gvr schema.GroupVersionResource, ns, name string, opt | |||
return ft.ObjectTracker.Get(gvr, ns, name, opts...) | |||
} | |||
|
|||
func (ft *fakeTracker) Update(gvr schema.GroupVersionResource, obj runtime.Object, ns string, opts ...metav1.UpdateOptions) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to file this away in my mental bank of "reasons the fake client is not what you want".
test/integration/test/deck_test.go
Outdated
// can rerun from. | ||
// Horologium itself is pretty good at handling the configmap update, but | ||
// not kubelet, according to | ||
// https://github.com/kubernetes/kubernetes/issues/30189 kubelet syncs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n.b. the linked issue says some semantically useless annotation update should kick the kubelet
if !passed { | ||
t.Fatal("Expected updated job.") | ||
// Wait for the first job to be created by horologium. | ||
initialJob := getLatestJob(t, jobName, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: inital = getLatest()
is confusing - is it the initial or the latest?
}); err != nil { | ||
t.Logf("ERROR CLEANUP: %v", err) | ||
} | ||
}) | ||
ctx := context.Background() | ||
|
||
getLatestJob := func(t *testing.T, jobName string, lastRun *v1.Time) *prowjobv1.ProwJob { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we're missing the meaning of what this function was originally written to do - is the issue that the function expected to sort these resources by the resourceVersion
at which they were created, but when there are interspersed UPDATE calls, the objects' current resourceVersion
no longer sorts them?
Can we sort by the job ID since we know that's monotonically increasing? Creation timestamp is an awkward choice as it can have ties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long response delay.
Which "job ID" are you referring to exactly? The only numerical ID I can see is the prow.k8s.io/build-id
and that is not unique per ProwJob. Is prow.k8s.io/id
not just a random UUID, but monotonically increasing?
I don't see any other good value to use besides the creation timestamp when I want to search by, well, creation order:
apiVersion: prow.k8s.io/v1
kind: ProwJob
metadata:
annotations:
prow.k8s.io/context: ""
prow.k8s.io/job: rerun-test-job-3a2c5361172244414edc254fc8d21de5
creationTimestamp: "2024-12-14T15:54:56Z"
generation: 3
labels:
created-by-prow: "true"
foo: foo
prow.k8s.io/build-id: "1867961480935116800"
prow.k8s.io/context: ""
prow.k8s.io/id: f48bd245-1335-4ef4-8556-d7b9d232014a
prow.k8s.io/job: rerun-test-job-3a2c5361172244414edc254fc8d21de5
prow.k8s.io/type: periodic
name: f48bd245-1335-4ef4-8556-d7b9d232014a
namespace: default
resourceVersion: "9383"
uid: 56b93b1c-4a53-4ba6-90e6-49ce918115cd
spec:
agent: kubernetes
cluster: default
job: rerun-test-job-3a2c5361172244414edc254fc8d21de5
namespace: test-pods
pod_spec:
containers:
- args:
- Hello World!
command:
- echo
image: localhost:5001/alpine
name: ""
resources: {}
prowjob_defaults:
tenant_id: GlobalDefaultID
report: true
type: periodic
status:
build_id: "1867961480935116800"
completionTime: "2024-12-14T15:54:58Z"
description: Job succeeded.
pendingTime: "2024-12-14T15:54:56Z"
pod_name: f48bd245-1335-4ef4-8556-d7b9d232014a
startTime: "2024-12-14T15:54:56Z"
state: success
} | ||
|
||
// Prevent Deck from being too fast and recreating the new job in the same second | ||
// as the previous one. | ||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the downside of having the second job created in the same second? Can we fix that instead of adding a sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the artifical delay, situations like this can happen:
pj[0] = name=3ee585ca-d93f-4fbc-9a09-b5c3baeeb2b6, created=2024-12-14 16:54:43 +0100 CET, id=1867961304296198144
pj[1] = name=b5578f85-ecb6-4741-b3d4-f9f33f099c59, created=2024-12-14 16:54:14 +0100 CET, id=1867961304296198144
pj[2] = name=094a01cf-782b-427a-b5e1-e02e551f604d, created=2024-12-14 16:54:43 +0100 CET, id=1867961425939402752
This leads to an instable sorting order, making the test flake. :-/
test/integration/test/setup.go
Outdated
} | ||
|
||
ready := true && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in my experience, the moment this does not correctly happen within the timeout, return ready
will hide the details from the engineer debugging this, which makes for an unpleasant set of next steps. Could we please format the conditions you're looking for as a string, log it out on state transitions (e.g. do not spam log when nothing has changed), and indicate whether the observed state is as expected or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks great! Couple small comments.
68c7dc4
to
543ebf9
Compare
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This PR brings Prow up-to-speed with the latest Kubernetes and controller-runtime dependencies, plus a few more changes to make these new dependencies work.
controller-tools 0.16.4
Without this update, codegen would fail:
golangci-lint 1.58.0
After updating code-generator, staticcheck suddenly threw false positives like:
However looking at the code, the
help == nil
check is leading to at.Fatal
, which should be recognized by staticcheck. I have no idea why this suddenly happened, but updating to the next highest golangci-lint version fixes the issue.Flakiness due to rate limitting
I noticed some tests flaking a lot and started digging. It turns out the issue wasn't actually from loops timing out or contexts getting cancelled, but from the client-side rate limitting that is enabled in the kube clients. I think during integration tests it doesn't make much sense to have rate limitting, as this would mean a lot of code potentially has to handle errors arising from it.
I have therefore disabled the rate limiter by setting
cfg.RateLimiter = flowcontrol.NewFakeAlwaysRateLimiter()
in the integration test utility code.Deck re-run tests
These tests have been reworked quite a bit, as they were quite flaky. The issue ultimately boiled down to the old code sorting ProwJobs by ResourceVersion, but during testing I found that it happens quite a lot that ProwJobs are created/updated nearly simultaneously. This has been resolved by sorting the ProwJobs by CreationTimestamp instead, which is unaffected by update calls.
However that is nearly the smallest change in the refactoring.
wait.PollUntilContextTimeout
. It's IMO unnecessary to have a back-off mechanism in integration tests like this. It just needlessly slows down the test.The "rotate Deployment instead of deleting Pods manually"-method has been applied to all other integration tests.