Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail fast on cloud node removal #372

Merged
merged 14 commits into from
May 15, 2024

Conversation

Vlatombe
Copy link
Member

@Vlatombe Vlatombe commented May 14, 2024

This cancels the node block immediately instead of waiting for the node to come in certain conditions when we are sure the node can't come back: using a cloud node, using OnceRetentionStrategy.

  • Fail fast after cloud node removal
  • Remove useless assertion

Testing done

Submitter checklist

Preview Give feedback

Vlatombe added 2 commits May 14, 2024 14:51
The agent is never named `ghost`, rather `slave0`
@Vlatombe Vlatombe changed the title fail fast on cloud node removal Fail fast on cloud node removal May 14, 2024
@@ -114,14 +119,17 @@ public class ExecutorStepDynamicContextTest {
sessions.then(j -> {
// Start up a build and then reboot and take the node offline
assertEquals(0, j.jenkins.getLabel("ghost").getNodes().size()); // Make sure test impl is correctly deleted
assertNull(j.jenkins.getNode("ghost")); // Make sure test impl is correctly deleted
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless assertion, since the node would just have the label ghost but usually named slave0 based on generation.

@@ -207,4 +215,36 @@ public class ExecutorStepDynamicContextTest {
});
}

@Test public void onceRetentionStrategyNodeDisappearance() throws Throwable {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially a copy of normalNodeDisappearance checking some behavioural differences.

@jglick jglick added the bug label May 14, 2024
@Vlatombe
Copy link
Member Author

Looks like some race condition, the Run with InterruptedBuildAction gets saved to disk properly, but somehow it's not visible in the Run object in the next session.

@Vlatombe Vlatombe marked this pull request as ready for review May 14, 2024 15:01
@Vlatombe Vlatombe requested a review from a team as a code owner May 14, 2024 15:01
@Vlatombe
Copy link
Member Author

Re-launching CI

@Vlatombe Vlatombe closed this May 14, 2024
@Vlatombe Vlatombe reopened this May 14, 2024
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by the doubled-up cause of interruption.

s.setRetentionStrategy(new OnceRetentionStrategy(0));
var run = p.scheduleBuild2(0).waitForStart();
j.waitForMessage("+ sleep infinity", run);
j.jenkins.removeNode(s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is what Reaper in kubernetes would do as soon as an agent pod is deleted.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I do not understand why there is a new cause of interruption at all. The only change that should need to be made is for RemovedNodeListener to interrupt the build immediately rather than after a delay, right?

Vlatombe added 5 commits May 15, 2024 09:24
* Use only one cause
* When build is cancelled immediately, use RemovedNodeCause
* When build is cancelled after observing timeout, use RemovedNodeTimeoutCause
* Introduce a marker interface to simplify matching in AgentErrorCondition
@Vlatombe Vlatombe requested a review from jglick May 15, 2024 08:52
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler now, thanks. Some optional suggestions.

Comment on lines +357 to +359
if (isOneShotAgent(node)) {
LOGGER.fine(() -> "Cancelling owner run for one-shot agent " + node.getNodeName() + " immediately");
cancelOwnerExecution(node, new RemovedNodeCause());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The crucial part FTR.)

@jglick
Copy link
Member

jglick commented May 15, 2024

ShellStepTest.removingAgentIsFatal needs a change to expected message.

@jglick jglick merged commit 1891ab0 into jenkinsci:master May 15, 2024
14 checks passed
@Vlatombe Vlatombe deleted the fail-fast-on-cloud-node-removal branch May 15, 2024 15:08
Comment on lines +362 to +363
return node instanceof AbstractCloudSlave ||
(node instanceof Slave && ((Slave) node).getRetentionStrategy() instanceof OnceRetentionStrategy);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this heuristic does not match EC2AbstractSlave extends Slave nor EC2RetentionStrategy extends RetentionStrategy. I guess we need to hard-code support for those nonstandard implementations. CC @car-roll

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we start introducing a marker interface for this usage?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, though I think it would suffice for EC2AbstractSlave to extend AbstractCloudSlave and EC2Computer to extend AbstractCloudComputer, with some minor refactoring to delete then-redundant logic. CloudBees-internal reference

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants