Implement agent reuse, toggled by a cluster profile config option #355

brandonvin · 2023-05-20T19:40:29Z

Description

This implements agent reuse following the approach outlined in these comments: #53 (comment) #53 (comment). Resolves #53

Cluster profile has an option to enable agent reuse, defaulting to false:

The naming, and presentation of this option, are open to feedback! 🙂

Changes

When pods are created, they are annotated with a hash of the elastic config. This ensures that when elastic config is changed, agents created from the old config will not be reused, and will eventually expire after the timeout.

When agent reuse is enabled, the main behavior changes are:

Job completion request -> instead of terminating the agent, mark the agent as idle so it may be considered for reuse
Should assign work request -> assign work only if the agent is available for work and the elastic config hash matches the agent
Create agent request -> only create an agent if none are available for reuse

Other supporting changes:

KubernetesInstance is an immutable data class constructed by a builder, and no longer contains a Kubernetes client.
Improving field names (e.g. properties) for clarity between cluster profile properties and elastic profile properties.
Refreshing instances moved inside the ServerPingRequestExecutor instead of "inline" in the KubernetesPlugin request handling block.
Added a request handler for REQUEST_PLUGIN_SETTINGS that returns an empty map - just to silence some warnings about this request not being implemented.
Some classes are refactored to allow testing with less mocking.
Removed an unnecessary semaphore.

This ended up being a pretty large set of changes. Happy to explain in more detail if needed!

Testing

Many unit tests updated and added, and run with ./gradlew test. I've also tested this running GoCD on my machine with a Kubernetes cluster in kind.

chadlwilson · 2023-06-01T07:58:06Z

Thanks for this. Appreciate all the work here. Unfortunately I have limited capacity to review, try out and give useful insight so might take me some time 🙏

woopstar · 2023-08-21T18:06:58Z

Nice work +1 on this

brandonvin · 2024-10-10T19:46:24Z

Hi there @chadlwilson! Is there any chance that this PR could move forward in the next month or so? Is there anything I can do to help, or make you more comfortable merging this? If not - no worries - I would probably take a stab at forking the repo, building from a fork, and testing this out more in our production GoCD server. If I find any issues from that I'll push them up to this PR.

chadlwilson · 2024-10-11T02:56:21Z

Thanks for the ping.

Honestly the challenge with these elastic agent plugins is just my concerns around regression and most don't have any particular "integration" tests when actually working with GoCD - and there are a lot of weird timing and other scenarios that can arise with the elastic agent plugins. (e.g what happens if the GoCD server dies, when it comes back does it still correctly adopt the agents or do they end up orphaned? what happens if a pod dies that was expecting to be reused? what if the container restarts inside?). This plugin also happens to probably be the most actively used elastic agent approach.

But ya, the real limitation has been time and energy, distraction and the extra 'friction' on validating this one. it is still on my radar, I've just been kicking the can down the road. But it's a bit easier for me to validate locally now that we have smaller non-Alpine arm64 container images to use on elastic agents.

I naively expected you were probably running this off your own build anyway, is that not the case? :-)

chadlwilson · 2024-12-07T09:22:07Z

After an inordinate amount of time, I gave this a quick whirl locally in a colima/k3s single node cluster. Certainly not tested in anger, nor fully, so I only have a few observations so far.

When do idle agents get terminated? I couldn't quite figure out the semantics here, or what was going on, and where the logic was for terminating them, or why they were being terminated. I did see one of my agents go away approximately 1 minute after it was last used, but others stayed alive much longer. Probably also needs to log something when its doing this, as currently all you see is that suddenly from server pings the pods has decreased.
- is it reusing the auto-register timeout and cleaning up agents if idle 10 minutes after creation perhaps? If so, I'm not sure about re-using this configuration - it's intended for agents that get stuck registering, pods that cant be scheduled due to cluster resource limits - that type of thing. Probably better to separate this as "idle timeout" similar to how EC2 instances are managed with the ECS plugin.
At one point the plugin created 1 new agent container, and then queued up jobs for that some agent rather than spawning new containers to serve the queued jobs. Each job as processed by the single agent 1-by-1. Why might this have happened? Perhaps a race condition on 'n' jobs seeing the agent as idle, but then the job is allocated to the sole agent and the other jobs aren't reassessed as needing new agents to be spawned? I found it difficult to replicate this, but it was strange.

(not necessarily a problem, marker for clarification) Similarly, I terminated 5 idle agents with kubectl delete pod and then scheduled new jobs. Since the GoCD server still thought all 5 agents were still there and idle, it did not create any new agents/pods. Probably expected. it eventually figured things out and created pods after the server-defined 2 minute jobStarvationTimeout. Probably need to check if behaviour is similar with other plugins.

2024-12-07 08:36:00,649 INFO  [318@MessageListener for CreateAgentListener] KubernetesPlugin:72 - [reuse] Not creating a new pod - found 4 eligible for reuse.
2024-12-07 08:36:08,305 INFO  [323@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-12-07 08:36:08,318 WARN  [323@MessageListener for ServerPingListener] KubernetesPlugin:97 - [Server Ping] Was expecting a containers with IDs [gocd-agent-34c87296-7ddb-4258-8991-78fb79b4e901, gocd-agent-4556df54-1341-4e2d-b16f-02d64a1200cd, gocd-agent-f891d9a2-1111-45d4-bcaa-b68cbc34aaef, gocd-agent-b917d30a-f23b-4916-94e0-285bc12eb812, gocd-agent-343a8567-40cb-47bd-acb3-
4ad03d1bd15e], but it was missing! Removing missing agents from config.
2024-12-07 08:37:08,312 INFO  [323@MessageListener for ServerPingListener] KubernetesPlugin:72 - [refresh-pod-state] Pod information successfully synced. All(Running/Pending) pod count is 0.
2024-12-07 08:37:55,839 INFO  [320@MessageListener for CreateAgentListener] KubernetesPlugin:92 - [reuse] Found 0 pods eligible for reuse for CreateAgentRequest for job 84: []

Generally speaking I'm finding this hard to review due to the other "supporting" and style-oriented changes in here which distract me a bit from the core feature. Largely this is probably because I didn't write any of the original code, and so it's not always obvious to me why the supporting changes were needed, or what the implications are.

when you say "Removed an unnecessary semaphore." it's difficult for me to know whether it is in fact unnecessary, and I have to dig back through and see why they added it in the first place.
there are probably good reasons the instances were refreshed before assessing pending pod limits, rather than waiting for 60 second pings? I don't know the implications of refreshing instances only during ping, possibly leads to making decisions on pending pod limits with out-of-date information?

Basically, this might be too complex for me to review. My main concern is to not break existing things, as I'm basically the only maintainer who looks at anything across all of GoCD server plus plugins and this is one of the most heavily used plugins. If we can't get it so it's easier for me to reason about and figure out how it might change the as-is behaviour or some other help to review/test so it's not all on me, it might be better to maintain a fork.

brandonvin marked this pull request as ready for review May 20, 2023 19:57

chadlwilson self-assigned this Aug 22, 2023

This was referenced Nov 30, 2023

Provide ability to configure "max-idle-timeout" for an elastic agent pod #54

Closed

Agent auto-register timeout at Elastic Agent Profiles level #52

Closed

chadlwilson force-pushed the add-option-to-enable-agent-reuse branch from 179c5a7 to a7e1761 Compare January 6, 2024 17:11

chadlwilson force-pushed the add-option-to-enable-agent-reuse branch from a7e1761 to 148c342 Compare October 11, 2024 02:57

brandonvin and others added 2 commits November 14, 2024 13:40

Implement agent reuse, toggled by a cluster profile config option

6c2dd50

Bump plugin version to reflect new feature

f7da6b9

chadlwilson force-pushed the add-option-to-enable-agent-reuse branch from 26fc03a to f7da6b9 Compare November 14, 2024 05:40

chadlwilson added the help wanted label Dec 12, 2024

chadlwilson removed their assignment Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement agent reuse, toggled by a cluster profile config option #355

Implement agent reuse, toggled by a cluster profile config option #355

brandonvin commented May 20, 2023 •

edited

Loading

chadlwilson commented Jun 1, 2023

woopstar commented Aug 21, 2023

brandonvin commented Oct 10, 2024 •

edited

Loading

chadlwilson commented Oct 11, 2024

chadlwilson commented Dec 7, 2024 •

edited

Loading

Implement agent reuse, toggled by a cluster profile config option #355

Are you sure you want to change the base?

Implement agent reuse, toggled by a cluster profile config option #355

Conversation

brandonvin commented May 20, 2023 • edited Loading

Description

Changes

Testing

chadlwilson commented Jun 1, 2023

woopstar commented Aug 21, 2023

brandonvin commented Oct 10, 2024 • edited Loading

chadlwilson commented Oct 11, 2024

chadlwilson commented Dec 7, 2024 • edited Loading

brandonvin commented May 20, 2023 •

edited

Loading

brandonvin commented Oct 10, 2024 •

edited

Loading

chadlwilson commented Dec 7, 2024 •

edited

Loading