-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator-sdk scorecard basic spec check occasionally produces timeout with no logs #5452
Comments
FTR, I encountered a failure recently with verbose logging turned on. AFAICT the interesting part (i.e. what it is that's actually timing out) is not getting logged, but posting here in case it's useful to someone investigating this.
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale
…On Tue, 4 Oct 2022 at 03:01, OpenShift Bot ***@***.***> wrote:
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually
close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
—
Reply to this email directly, view it on GitHub
<#5452 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADXPTAQ6KM5PFD6GPDGGULWBN6V7ANCNFSM5J5U4XVQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hello team, that's an old issue but we keep seeing it regularly while running Preflight which currently runs operator-sdk 1.22.2. Here are the logs from preflight.log (BasicSpecCheck.json is not created at all):
Now we have more logs (thank you team!) and see that the timeout happens here: operator-sdk/internal/cmd/operator-sdk/scorecard/cmd.go Lines 96 to 98 in 5d541d0
@asmacdo @theishshah do you have any clues? cc: @acornett21 |
Upd. In half of such situations, the scorecard-test pod is stuck in the namespace, failing to terminate. This last bad thing should be easy to fix, whatever the reason for the initial issue.
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle stale |
@tkrishtop Recently operator-sdk switched to using their own images for this cmd. So instead of using an image from |
Hi @acornett21, thank you for the information! Typically, we run the scorecard-sdk from preflight, which currently uses version 1.26.0. However, I do have the option to run the standalone scorecard-sdk, so perhaps I'll activate it for our daily runs using version 1.28.0. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Bug Report
I'm from the DCI team. As a part of our daily cert suite, we regularly run
operator-sdk scorecard --selector=test=basic-check-spec-test
for two operators:simple-demo-operator
andtestpmd-operator
, 10 tests per day.Normally,
basic-check-spec-test
should be green for both. But it fails occasionally in 20% of cases for both operators in a row with timeout errorerror running tests context deadline exceeded
. To increase timeout withwait-time
up to 300s doesn't help. Also, the test is always failing for both operators in a row and it looks like a 10-20 min of some internal API.To have more information, could you please add more information in the logs about where did exactly the timeout happen?
What did you do?
What did you expect to see?
The results of
basic-check-spec-test
should be stable. In the case of timeout, it would be nice to have logs to identify what is the reason for this timeout.What did you see instead? Under which circumstances?
Timeout in 20% of cases
error running tests context deadline exceeded
with no detailed logs.Environment
Operator type:
Kubernetes cluster type:
Happens randomly for the latest stable OCP 4.7, OCP 4.8, OCP 4.9, OCP 4.10
$ operator-sdk version
Possible Solution
It would be nice to have more detailed logs to identify what is the reason for this timeout.
The text was updated successfully, but these errors were encountered: