Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) #98

andcri · 2024-01-05T10:43:34Z

Hello,
I m using the latest version of this plugin trough the gocd helm chart, when I deploy the helm chart for the first time everything works fine but when I restart the pod (or the pod gets restarted by a helm update) I get the following error

2024-01-05 10:36:54,589  INFO [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:72 [plugin-cd.go.contrib.elasticagent.kubernetes] - Did not find any running kubernetes pods.
2024-01-05 10:37:54,579 ERROR [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:127 [plugin-cd.go.contrib.elasticagent.kubernetes] - Error occurred while trying to list kubernetes pods:
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).

Is this the expected behavior when running in a cluster? do I need to add extra configuration in order to avoid this? Thanks

The text was updated successfully, but these errors were encountered:

chadlwilson · 2024-01-05T11:28:58Z

It's not expected that if it works fine via the Helm chart the first time that it stops working thereafter with no other changes made on your side of things. I assume by "works fine", you mean can create new pods etc and then after the restart nothing works at all, it cannot create new pods etc?

There should be many more errors fro logs than that if it cannot create pods etc, I'd have thought.

Which Kubernetes version, Kubernetes type (EKS? GKE? a local single node deployment? etc?) and GoCD Helm chart version are you using?

andcri · 2024-01-05T11:43:51Z

Thanks for the reply.

By works fine I mean that it creates agents pods whenever I trigger the example pipeline in gocd, the pipeline completes and then the agent pod shuts down.
When I restart the gocd-server pod and trigger the example pipline in gocd nothing happens and I get the logs shown.

I m using EKS and the k8s version is 1.25, the chart version that I m using is the latest one: 2.5.1

This is the full logs that I get from the gocd-server pod, after that it repeats the same error

jvm 1    | 2024-01-05 10:36:54,589  INFO [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:72 [plugin-cd.go.contrib.elasticagent.kubernetes] - Did not find any running kubernetes pods.
jvm 1    | 2024-01-05 10:37:54,579 ERROR [167@MessageListener for ServerPingListener] p.c.g.c.e.k.c.g.c.e.KubernetesPlugin:127 [plugin-cd.go.contrib.elasticagent.kubernetes] - Error occurred while trying to list kubernetes pods:
jvm 1    | io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
jvm 1    |     at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:450)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:418)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:97)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesAgentInstances.refreshAll(KubernetesAgentInstances.java:156)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.refreshInstancesForCluster(KubernetesPlugin.java:140)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.refreshInstancesForAllClusters(KubernetesPlugin.java:122)
jvm 1    |     at cd.go.contrib.elasticagent.KubernetesPlugin.handle(KubernetesPlugin.java:90)
jvm 1    |     at com.thoughtworks.go.plugin.infra.DefaultPluginManager.lambda$submitTo$0(DefaultPluginManager.java:134)
jvm 1    |     at com.thoughtworks.go.plugin.infra.FelixGoPluginOSGiFramework.executeActionOnTheService(FelixGoPluginOSGiFramework.java:205)
jvm 1    |     at com.thoughtworks.go.plugin.infra.FelixGoPluginOSGiFramework.doOn(FelixGoPluginOSGiFramework.java:164)
jvm 1    |     at com.thoughtworks.go.plugin.infra.DefaultPluginManager.submitTo(DefaultPluginManager.java:131)
jvm 1    |     at com.thoughtworks.go.plugin.access.PluginRequestHelper.submitRequest(PluginRequestHelper.java:49)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.v5.ElasticAgentExtensionV5.serverPing(ElasticAgentExtensionV5.java:152)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.ElasticAgentExtension.serverPing(ElasticAgentExtension.java:63)
jvm 1    |     at com.thoughtworks.go.plugin.access.elastic.ElasticAgentPluginRegistry.serverPing(ElasticAgentPluginRegistry.java:51)
jvm 1    |     at com.thoughtworks.go.server.messaging.elasticagents.ServerPingListener.onMessage(ServerPingListener.java:31)
jvm 1    |     at com.thoughtworks.go.server.messaging.elasticagents.ServerPingListener.onMessage(ServerPingListener.java:21)
jvm 1    |     at com.thoughtworks.go.server.messaging.activemq.JMSMessageListenerAdapter.runImpl(JMSMessageListenerAdapter.java:83)
jvm 1    |     at com.thoughtworks.go.server.messaging.activemq.JMSMessageListenerAdapter.run(JMSMessageListenerAdapter.java:63)
jvm 1    |     at java.base/java.lang.Thread.run(Unknown Source)
jvm 1    | Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1:443/api/v1/namespaces/gocd/pods. Message: Unauthorized. Received status: Status(apiVersion=v1, code=401, details=null, kind=Status, message=Unauthorized, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Unauthorized, status=Failure, additionalProperties={}).
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:660)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:640)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:589)
jvm 1    |     at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.CompletableFuture.complete(Unknown Source)
jvm 1    |     at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:136)
jvm 1    |     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
jvm 1    |     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
jvm 1    |     ... 1 common frames omitted

chadlwilson · 2024-01-05T16:38:26Z

It kind of looks like the service account token you're using somehow doesn't have the permissions for the namespace it's trying to look for agent pods within.

Are there any values from the chart that you override on installation (for the server, that is). Especially anything related to the storage mounts/volumes.
Did you go in and modify any of the values in Elastic Agents Configurations (either the main cluster profile, or the individual elastic agent profile?
How did you go about granting the right permissions for the service account? As noted in Better document how to get service account tokens on modern Kubernetes kubernetes-elastic-agents#328 there are a few ways to do this. I can't recall if 1.25 still is doing all the magic to allow this to work without k8s admin work, so anything you did here would be good to know.
Are there any weird changes being shown to the GoCD server config after the restart?

The Helm chart tries to bootstrap the server with an initial config that makes sense, however something may be being changed after the first run, or the restart is somehow removing some important config or not using the right service account token. You may be able to see these changes made to the server config from the Stage Details view for a build you are trying to run - when logged in as an admin look for these "Config Changed" links on the Stage history:

chadlwilson · 2024-01-06T16:44:17Z

OK, I have confirmed that this is because of changes in the way "bound service account tokens" work in Kubernetes 1.24+. They are now bound to the pod lifecycle, rather than the service account itself so they used to basically last forever, but now do not.

What this means is that the "magic" the GoCD Helm chart does on first start to populate the service account token into the "cluster profile" populates your pod with an auto-mounted token bound to the pod. It only does this once. When the pod restarts, this token is invalidated by the Kubernetes API and is no longer valid. It may also not work for longer than 1 hour, since the plugin and GoCD cannot re-read the value.

Essentially, this makes the automation in the Helm chart deceptive as it won't work for long.

To fix this right now, you need to do something like mentioned in gocd/kubernetes-elastic-agents#328 to create a long-lived token, extract it from Kubernetes and then replace the value in your cluster profile with this.

Additionally, if you don't manually set the CA Certificate Data, you might experience a different bug I have fixed in this release included in the latest Helm chart version (2.5.2).

In any case, this needs a better fix as the Helm chart is currently misleading. Will move this issue there as the main problem is with the misleading behaviour of the Helm chart which makes you think you don't need to do anything with the service account tokens to keep it working, since on first install it will work for a while without the user doing anything.

chadlwilson · 2024-01-07T15:20:39Z

The root of this problem is now fixed with chart 2.6.0 (which defaults to plugin v4.0.0-505) if you want to give it a go.

Note that the server configuration only happens once, so you'd either need to start with a fresh server (probably not ideal), or upgrade the plugin via the Helm chart, and then clear out the older configuration values from the Cluster Profile so the new plugin version falls back to the defaults that the Helm chart has always ensured are set up.

andcri · 2024-01-08T08:20:36Z

Thank you very much for the quick fix and the explanation, works perfectly now 👍

chadlwilson changed the title ~~Plugin doenst work after pod restart~~ Plugin doesn't work after pod restart Jan 5, 2024

chadlwilson transferred this issue from gocd/kubernetes-elastic-agents Jan 6, 2024

chadlwilson changed the title ~~Plugin doesn't work after pod restart~~ Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) Jan 6, 2024

chadlwilson mentioned this issue Jan 6, 2024

Better document how to get service account tokens on modern Kubernetes gocd/kubernetes-elastic-agents#328

Closed

chadlwilson self-assigned this Jan 6, 2024

chadlwilson added the bug Something isn't working label Jan 6, 2024

chadlwilson closed this as completed in 43174a9 Jan 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) #98

Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) #98

andcri commented Jan 5, 2024

chadlwilson commented Jan 5, 2024

andcri commented Jan 5, 2024

chadlwilson commented Jan 5, 2024

chadlwilson commented Jan 6, 2024 •

edited

Loading

chadlwilson commented Jan 7, 2024 •

edited

Loading

andcri commented Jan 8, 2024

Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) #98

Elastic Agent Plugin can no longer create new agents after pod restart (or 1 hour) #98

Comments

andcri commented Jan 5, 2024

chadlwilson commented Jan 5, 2024

andcri commented Jan 5, 2024

chadlwilson commented Jan 5, 2024

chadlwilson commented Jan 6, 2024 • edited Loading

chadlwilson commented Jan 7, 2024 • edited Loading

andcri commented Jan 8, 2024

chadlwilson commented Jan 6, 2024 •

edited

Loading

chadlwilson commented Jan 7, 2024 •

edited

Loading