Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent Reuse #53

Open
saifally opened this issue May 8, 2018 · 11 comments · May be fixed by #355
Open

Elastic Agent Reuse #53

saifally opened this issue May 8, 2018 · 11 comments · May be fixed by #355

Comments

@saifally
Copy link

saifally commented May 8, 2018

Hi Guys,

I have noticed that the elastic agents do not get reused by the builds .
The plugin spins up another pod having the agent even if there is an idle one if I start the build

I am wondering if this is intentional.

I can contribute a change for this if its a worthwhile feature change.

Thanks,
Saif

@varshavaradarajan
Copy link
Contributor

@saifally - with the introduction of v3 of the elastic agent extension, this change around running exactly one job per agent was made. This issue talks about why we introduced the change.

We are thinking of providing an option at the elastic profile level called Reuse agent using which one elastic agent could be used for multiple jobs.

@saifally
Copy link
Author

saifally commented May 8, 2018

The feature would be really useful for our usecase. Let me know if you need contribution.

@jyotisingh
Copy link

@varshavardharajan - the issue that you point to was to avoid reusing go-agent which in this case is a container. I assume none of us are talking about reusing containers. Its about reusing pods which shouldn't have anything to do with the specific issue you point out. Correct?

@Evesy
Copy link

Evesy commented Aug 16, 2018

This would be useful for us too. Spinning up a new pod on every single pipeline/stage/job ends up adding quite a bit of time (circa 30s for the agent to start, connect to GoCD server etc.) to pipelines. It would be great to be able to reuse these pods and only scale them down if they haven't been requested within some grace period (20 minutes for example)

@skloss
Copy link

skloss commented Mar 12, 2019

Would i not be more efficient instead of reusing pods, to start a the next stage pod with the previous stage. As example you have a pipe with 4 stages (build, publish, infrastructure, deploy), when build stage runs the pod for the publish stage get started and is ready to serve traffic so far it needed instead of waiting up to 30s, that the pods is ready for the stage. In our case this would cut the build times by min 2 minutes.

@arvindsv
Copy link
Member

@skloss Possibly. But, GoCD doesn't schedule the next stage while the previous stage is running. So, the plugin won't know that it needs to bring up a different pod. It's possible that the previous stage fails and the next stage pod is not needed. It's also possible that the next stage needs to start with a completely different profile too.

@gocd-matrix
Copy link

Hi, I just can add that I'd really love to see some improvement here. If we have many quick jobs, the "spin up penalty" of approx. 25s is a pitty. So currently I can choose to have:

a) "pre-forked workers" (the classic agents that allow reusage)
b) "on-demand workers" (elastic agents that don't allow reusage)

The "reuse agent" functionality should address this nicely and give use the best of both worlds.

@brandonvin
Copy link
Contributor

brandonvin commented Jan 22, 2023

It sounds like this would imply a few related changes to the plugin:

  • JobCompletion: It could no-op, and allow pods to expire based on their idle timeout.

  • CreateNewAgent: Survey existing pods created by the plugin. If there are already pods booting up or idle, for the same cluster profile and elastic profile, skip creating a new pod. Also make sure not to create too many pods (keep using the max number of pods).

  • ShouldAssignWork: Return true only if the job matches the cluster profile and elastic profile of the agent.

Pods can be configured with the full power of K8s YAML - including volumes, secrets, resources, multiple containers, labels etc. If a user updates a pod configuration in the plugin, any existing pods using the old config should be retired and not used again. So I think in order to allow a pod to be reused, the plugin would have to require that the cluster profile and elastic profile associated with the job are identical to what that pod has.

Does that sound about right? If so, I may start a branch for this and see how it goes.

@chadlwilson
Copy link
Member

@brandonvin While I of course support any innovative thinking/work here, I suspect (but do not know) that it is possible that the server side, scheduling and the elastic agent plugin model itself may in some way assume that an elastic agent can only complete one job even if that wasn't the original design intention. Just based on my wondering why none of the elastic agent plugins seem to work in this way already 😅 So would suggest trying to validate the whole idea in as quick-and-dirty a way as possible before going too deep into perfection 🙏

I do however note in the plugin API that https://plugin-api.gocd.org/current/elastic-agents/#job-completion specifically notes that plugins might want to keep agents running longer and that https://plugin-api.gocd.org/current/elastic-agents/#server-ping would be able to use it as a trigger for "kill idle agents" (one minor bit possibly missing from your list above).

As you've probably figured out, the current termination happens at

ClusterProfileProperties clusterProfileProperties = jobCompletionRequest.clusterProfileProperties();
String elasticAgentId = jobCompletionRequest.getElasticAgentId();
Agent agent = new Agent();
agent.setElasticAgentId(elasticAgentId);
LOG.info(format("[Job Completion] Terminating elastic agent with id {0} on job completion {1}.", agent.elasticAgentId(), jobCompletionRequest.jobIdentifier()));
List<Agent> agents = Arrays.asList(agent);
pluginRequest.disableAgents(agents);
agentInstances.terminate(agent.elasticAgentId(), clusterProfileProperties);
pluginRequest.deleteAgents(agents);
return DefaultGoPluginApiResponse.success("");
- I would suggest if the plugin is enabled to allow re-use, you could comment out the termination (forgetting about idle time and eventually killing for now) and see how much of it "just works".

I do suspect dealing with possible race conditions and interactions with https://plugin-api.gocd.org/current/elastic-agents/#should-assign-work might be challenging/interesting, but I have not personally looked in detail at the elastic agent area (other than trying to fix some bugs with the ECS elastic agent). I wonder if the plugin will have enough event hooks to know whether an agent is truly idle rather than "working on a long build", or "taking a long time to boot up and register with the server".

ShouldAssignWork is quite dumb right now, seems to keep the jobId with the pod metadata, implying it assumes a pod is only used for a single job.

KubernetesInstance pod = agentInstances.find(request.agent().elasticAgentId());
if (pod == null) {
return DefaultGoPluginApiResponse.success("false");
}
if (request.jobIdentifier().getJobId().equals(pod.jobId())) {
LOG.debug(format("[should-assign-work] Job with identifier {0} can be assigned to an agent {1}.", request.jobIdentifier(), pod.name()));
return DefaultGoPluginApiResponse.success("true");
}

@arvindsv
Copy link
Member

At least as envisioned originally, there was no expectation of each job being run on a separate agent. But, the proof will be in the code. :) So, I support and agree with @chadlwilson's suggestion of a quick-and-dirty validation.

@brandonvin
Copy link
Contributor

brandonvin commented Jan 27, 2023

Alright, thanks for your input @arvindsv and @chadlwilson! I really appreciate your responsiveness and openness on this.

So far, I've used a sort of quick and dirty approach, and I believe I've verified that K8s pods can be reused for multiple jobs. At the moment, I've only looked at handling the happy path. For example:

  1. Set up a few different jobs to use the same K8s elastic profile.
  2. Run the jobs - this will spawn new K8s pods.
  3. Once those jobs finish, rerun the jobs.
  4. Instead of creating new pods, these jobs get assigned to the existing pods.

In this happy path, I do see a big improvement in the time waiting for an agent, as the pod creation and agent bootstrap are skipped.

However, I'm noticing that having long-lived reusable agents may open up some additional changes needed in the plugin in the "non-happy path" cases. For example, if a job running on a K8s agent is canceled, the plugin needs to become aware that the pod is now ready to accept new work. EDIT: after some more testing, I'm finding that the job completion request is sufficient to handle the specific case of canceled jobs. I suppose querying the GoCD server for agent statuses would still be worthwhile, but more of an optimization for quicker recovery in odd cases where pods lose contact with the GoCD server.

I've started some initial design on how to handle these cases. As a strawman, the plugin could piggyback on the server ping request (or a background thread) to query the GoCD server for agent statuses and take action on some of those statuses.

Since these are some significant design changes, and I do want to move forward to a PR eventually - is there a preferred way to propose this kind of larger design change (for example, an "enhancement proposal" doc) and get input on it, ahead of a PR?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants