Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Guardrail to avoid downtime #3878

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Abhish2702
Copy link

@Abhish2702 Abhish2702 commented Oct 8, 2024

This PR introduces a guardrail in the rollouts controller to help prevent unexpected downtime during rollouts. The guardrail adds an additional check to ensure that the number of replicas in the stable ReplicaSet is sufficient before any traffic switch occurs.

For example, if the desired weight for canary replicas is set to 60%, and there are 10 replicas in total, the code will verify that, before diverting 60% of the traffic to the canary replicas, at least 40% of the replicas (i.e., 4 in this case) are available in the stable ReplicaSet.

This check is crucial to prevent potential downtime. A common scenario where downtime can occur is when a rollout is already in progress, and a new deployment is triggered. In such cases, if the stable replicas are insufficient, it could lead to service disruption.

Resolves #3372
Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@Abhish2702 Abhish2702 changed the title Guardrail to avoid downtime fix: Guardrail to avoid downtime Oct 8, 2024
Copy link
Contributor

github-actions bot commented Oct 8, 2024

Published E2E Test Results

  4 files    4 suites   3h 13m 37s ⏱️
113 tests 104 ✅  7 💤 2 ❌
454 runs  424 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit c3f42e3.

♻️ This comment has been updated with latest results.

Copy link

codecov bot commented Oct 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.73%. Comparing base (5f59344) to head (c3f42e3).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3878      +/-   ##
==========================================
+ Coverage   82.69%   82.73%   +0.03%     
==========================================
  Files         163      163              
  Lines       22895    22911      +16     
==========================================
+ Hits        18934    18956      +22     
+ Misses       3087     3083       -4     
+ Partials      874      872       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Oct 8, 2024

Published Unit Test Results

2 281 tests   2 281 ✅  2m 59s ⏱️
  128 suites      0 💤
    1 files        0 ❌

Results for commit c3f42e3.

♻️ This comment has been updated with latest results.

Signed-off-by: Abhishek Bansal <[email protected]>
@zachaller
Copy link
Collaborator

zachaller commented Nov 20, 2024

Very similar logic is already defined via

func (c *rolloutContext) ensureSVCTargets(svcName string, rs *appsv1.ReplicaSet, checkRsAvailability bool) error {
I do not think it makes sense to add another check

@Abhish2702
Copy link
Author

Very similar logic is already defined via

func (c *rolloutContext) ensureSVCTargets(svcName string, rs *appsv1.ReplicaSet, checkRsAvailability bool) error {

I do not think it makes sense to add another check

@zachaller but this check is not preventing downtime if we trigger rollout in between the deployment.

@KarimTarek
Copy link

Just wanna double down that this is super critical for us to ensure that if a rollout occurred in the middle of an already ongoing one, we must ensure that the traffic won't get shifted to the stable replica set before it being fully scaled back to the desired amount specially with dynamicStableScale (talking about late stages in the canary process where stable might be at 25% and suddenly gets all 100% traffic if you triggered a new one). Hope it makes 1.8 to unblock us going into prod.

@heshamelsherif97
Copy link

@zachaller Is this planned for 1.8?

@zachaller
Copy link
Collaborator

This would be a bug fix so we could put it into 1.8 but I need time to go through the PR and give it a good review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants