fix: Guardrail to avoid downtime #3878

Abhish2702 · 2024-10-08T07:01:34Z

This PR introduces a guardrail in the rollouts controller to help prevent unexpected downtime during rollouts. The guardrail adds an additional check to ensure that the number of replicas in the stable ReplicaSet is sufficient before any traffic switch occurs.

For example, if the desired weight for canary replicas is set to 60%, and there are 10 replicas in total, the code will verify that, before diverting 60% of the traffic to the canary replicas, at least 40% of the replicas (i.e., 4 in this case) are available in the stable ReplicaSet.

This check is crucial to prevent potential downtime. A common scenario where downtime can occur is when a rollout is already in progress, and a new deployment is triggered. In such cases, if the stable replicas are insufficient, it could lead to service disruption.

Resolves #3372
Checklist:

Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
I've signed my commits with DCO
I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
My builds are green. Try syncing with master if they are not.
My organization is added to USERS.md.

github-actions · 2024-10-08T07:34:39Z

Published E2E Test Results

4 files 4 suites 3h 13m 37s ⏱️
113 tests 104 ✅ 7 💤 2 ❌
454 runs 424 ✅ 28 💤 2 ❌

For more details on these failures, see this check.

Results for commit c3f42e3.

♻️ This comment has been updated with latest results.

codecov · 2024-10-08T07:34:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.73%. Comparing base (5f59344) to head (c3f42e3).

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3878      +/-   ##
==========================================
+ Coverage   82.69%   82.73%   +0.03%     
==========================================
  Files         163      163              
  Lines       22895    22911      +16     
==========================================
+ Hits        18934    18956      +22     
+ Misses       3087     3083       -4     
+ Partials      874      872       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-10-08T07:34:46Z

Published Unit Test Results

2 281 tests 2 281 ✅ 2m 59s ⏱️
128 suites 0 💤
1 files 0 ❌

Results for commit c3f42e3.

♻️ This comment has been updated with latest results.

Signed-off-by: Abhishek Bansal <[email protected]>

sonarqubecloud · 2024-11-17T11:28:09Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

zachaller · 2024-11-20T16:13:56Z

Very similar logic is already defined via

argo-rollouts/rollout/service.go

Line 292 in 5f59344

    
           func (c *rolloutContext) ensureSVCTargets(svcName string, rs *appsv1.ReplicaSet, checkRsAvailability bool) error {

I do not think it makes sense to add another check

Abhish2702 · 2024-11-29T08:12:31Z

Very similar logic is already defined via

argo-rollouts/rollout/service.go

Line 292 in 5f59344

func (c *rolloutContext) ensureSVCTargets(svcName string, rs *appsv1.ReplicaSet, checkRsAvailability bool) error {

I do not think it makes sense to add another check

@zachaller but this check is not preventing downtime if we trigger rollout in between the deployment.

KarimTarek · 2024-12-26T19:46:21Z

Just wanna double down that this is super critical for us to ensure that if a rollout occurred in the middle of an already ongoing one, we must ensure that the traffic won't get shifted to the stable replica set before it being fully scaled back to the desired amount specially with dynamicStableScale (talking about late stages in the canary process where stable might be at 25% and suddenly gets all 100% traffic if you triggered a new one). Hope it makes 1.8 to unblock us going into prod.

heshamelsherif97 · 2024-12-30T12:57:17Z

@zachaller Is this planned for 1.8?

zachaller · 2025-01-02T15:27:15Z

This would be a bug fix so we could put it into 1.8 but I need time to go through the PR and give it a good review.

Abhish2702 changed the title ~~Guardrail to avoid downtime~~ fix: Guardrail to avoid downtime Oct 8, 2024

Abhish2702 force-pushed the fix-downtime branch from b337b49 to 34878b6 Compare October 8, 2024 07:58

zachaller mentioned this pull request Oct 17, 2024

fix: update during setWeight: 100 could cause 503 errors #3862

Draft

6 tasks

zachaller added this to the v1.8 milestone Nov 3, 2024

Abhish2702 force-pushed the fix-downtime branch from 7d87a76 to fda1eba Compare November 4, 2024 11:43

fix: unexpected downtime in rollouts

3ef7a60

Signed-off-by: Abhishek Bansal <[email protected]>

Abhish2702 force-pushed the fix-downtime branch from 39898f7 to 3ef7a60 Compare November 5, 2024 05:55

staffanselander mentioned this pull request Nov 15, 2024

traffic is switched before replicaset is fully available when using rollbackWindow #3941

Open

Merge branch 'master' into fix-downtime

c3f42e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Guardrail to avoid downtime #3878

fix: Guardrail to avoid downtime #3878

Abhish2702 commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading

codecov bot commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading

sonarqubecloud bot commented Nov 17, 2024

zachaller commented Nov 20, 2024 •

edited

Loading

Abhish2702 commented Nov 29, 2024

KarimTarek commented Dec 26, 2024

heshamelsherif97 commented Dec 30, 2024

zachaller commented Jan 2, 2025

fix: Guardrail to avoid downtime #3878

Are you sure you want to change the base?

fix: Guardrail to avoid downtime #3878

Conversation

Abhish2702 commented Oct 8, 2024 • edited Loading

github-actions bot commented Oct 8, 2024 • edited Loading

Published E2E Test Results

codecov bot commented Oct 8, 2024 • edited Loading

Codecov Report

github-actions bot commented Oct 8, 2024 • edited Loading

Published Unit Test Results

sonarqubecloud bot commented Nov 17, 2024

Quality Gate passed

zachaller commented Nov 20, 2024 • edited Loading

Abhish2702 commented Nov 29, 2024

KarimTarek commented Dec 26, 2024

heshamelsherif97 commented Dec 30, 2024

zachaller commented Jan 2, 2025

Abhish2702 commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading

codecov bot commented Oct 8, 2024 •

edited

Loading

github-actions bot commented Oct 8, 2024 •

edited

Loading

zachaller commented Nov 20, 2024 •

edited

Loading