Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner pod ephemerality with emptyDir #481

Open
joshrichards37 opened this issue Aug 31, 2022 · 7 comments
Open

Runner pod ephemerality with emptyDir #481

joshrichards37 opened this issue Aug 31, 2022 · 7 comments
Labels
question Further information is requested

Comments

@joshrichards37
Copy link

joshrichards37 commented Aug 31, 2022

Hi there,

I am in the process of implementing the operator in our k8s cluster, and everything has been great and straight forward so far.

I just have a question around ephemerality of the pods. I have tried using the myoung34 derivate of the container image and passing the EPHEMERAL env var through, and this does seem to restart the runner container which is great however it does not restart the pod, which means the emptyDir volumes don't get recreated and persist on the cluster node.

Using the myoung34 derivate also doesn't seem to work with the runner reconciliation meaning that the autoscaling isn't working for me right now using the derivate, here are some logs when using the derivate:

2022-08-31T11:50:17.994Z	INFO	controllers.GithubActionRunner	Registration token expired, updating	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:50:18.236Z	INFO	controllers.GithubActionRunner	Unregistering runner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox", "name": "runner-poolsandbox-pod-hh2bv", "id": 9895}
2022-08-31T11:50:18.613Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:50:18.843Z	INFO	controllers.GithubActionRunner	Scaling up	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox", "numInstances": 1}
2022-08-31T11:50:18.868Z	INFO	controllers.GithubActionRunner	Creating a new Pod	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox", "Pod.Namespace": "github-actions-runner-operator", "Pod.Name": "runner-poolsandbox-pod-9ll4b", "result": "created"}
2022-08-31T11:50:18.869Z	DEBUG	events	Normal	{"object": {"kind":"GithubActionRunner","namespace":"github-actions-runner-operator","name":"runner-poolsandbox","uid":"0f373e1e-2712-45ea-9a1a-d7dc974533f7","apiVersion":"garo.tietoevry.com/v1alpha1","resourceVersion":"1775143"}, "reason": "Scaling", "message": "Created pod github-actions-runner-operator/runner-poolsandbox-pod-9ll4b"}
2022-08-31T11:50:18.876Z	DEBUG	events	Warning	{"object": {"kind":"GithubActionRunner","namespace":"github-actions-runner-operator","name":"runner-poolsandbox","uid":"0f373e1e-2712-45ea-9a1a-d7dc974533f7","apiVersion":"garo.tietoevry.com/v1alpha1","resourceVersion":"1775143"}, "reason": "ProcessingError", "message": "Operation cannot be fulfilled on githubactionrunners.garo.tietoevry.com \"runner-poolsandbox\": the object has been modified; please apply your changes to the latest version and try again"}
2022-08-31T11:50:18.884Z	ERROR	util.api	unable to update status	{"error": "Operation cannot be fulfilled on githubactionrunners.garo.tietoevry.com \"runner-poolsandbox\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/evryfs/github-actions-runner-operator/controllers.(*GithubActionRunnerReconciler).manageOutcome
	/workspace/controllers/githubactionrunner_controller.go:181
github.com/evryfs/github-actions-runner-operator/controllers.(*GithubActionRunnerReconciler).handleScaling
	/workspace/controllers/githubactionrunner_controller.go:137
github.com/evryfs/github-actions-runner-operator/controllers.(*GithubActionRunnerReconciler).Reconcile
	/workspace/controllers/githubactionrunner_controller.go:97
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-08-31T11:50:18.884Z	ERROR	controller.githubactionrunner	Reconciler error	{"reconciler group": "garo.tietoevry.com", "reconciler kind": "GithubActionRunner", "name": "runner-poolsandbox", "namespace": "github-actions-runner-operator", "error": "Operation cannot be fulfilled on githubactionrunners.garo.tietoevry.com \"runner-poolsandbox\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-08-31T11:50:18.884Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:50:19.118Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:50:19.131Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:50:19.460Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:50:19.469Z	ERROR	util.api	unable to update status	{"error": "Operation cannot be fulfilled on githubactionrunners.garo.tietoevry.com \"runner-poolsandbox\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/evryfs/github-actions-runner-operator/controllers.(*GithubActionRunnerReconciler).manageOutcome
	/workspace/controllers/githubactionrunner_controller.go:181
github.com/evryfs/github-actions-runner-operator/controllers.(*GithubActionRunnerReconciler).handleScaling
	/workspace/controllers/githubactionrunner_controller.go:122
github.com/evryfs/github-actions-runner-operator/controllers.(*GithubActionRunnerReconciler).Reconcile
	/workspace/controllers/githubactionrunner_controller.go:97
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-08-31T11:50:19.469Z	ERROR	controller.githubactionrunner	Reconciler error	{"reconciler group": "garo.tietoevry.com", "reconciler kind": "GithubActionRunner", "name": "runner-poolsandbox", "namespace": "github-actions-runner-operator", "error": "Operation cannot be fulfilled on githubactionrunners.garo.tietoevry.com \"runner-poolsandbox\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
2022-08-31T11:50:19.474Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:50:19.705Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:51:19.716Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:51:19.949Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:52:19.963Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:52:20.190Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:53:20.203Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:53:20.434Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:54:20.450Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:54:20.684Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:55:20.702Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:55:20.935Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:56:20.954Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T11:56:21.209Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}

When I have been running some tests using the master image, it seems that the behaviour is:

  • Scale pod up
  • Schedule workload on pod
  • Scale up additional pod to pick up work
  • Remove original pod once work is complete and no jobs are pending
  • Additional pod remains waiting to pick up work

This is great if we don't have many jobs waiting to be processed however sometimes we have 10s of jobs waiting to be processed and don't want to run the risk of running out of disk space on our cluster nodes. We are looking at implementing karpenter in the future to handle the scaling of cluster nodes but don't have the time right now to do so.

Is there a way right now to make the master image behave in an ephemeral way by recreating the pod and emptyDirs when the job has finished?

Thanks in advance

@davidkarlsen
Copy link
Collaborator

davidkarlsen commented Aug 31, 2022

The right way to have ephemeral pods is to use the ephemeral flag on the pod: https://github.com/myoung34/docker-github-actions-runner#environment-variables - these pod will then start up, run their job, and after they finish they should get into status Completed and eventually deleted.

to control the scaling (and thus avoid running out of resources, you set the:

 maxRunners: 18
minRunners: 0

fields in the CR.

GitHub
This will run the new self-hosted github actions runners with docker-in-docker - GitHub - myoung34/docker-github-actions-runner: This will run the new self-hosted github actions runners with docker...

@davidkarlsen davidkarlsen added the question Further information is requested label Aug 31, 2022
@joshrichards37
Copy link
Author

joshrichards37 commented Aug 31, 2022

Hi @davidkarlsen, thanks for your response.

I have the scaling configured in my deployment, and it works fine when using the quay.io/evryfs/github-actions-runner:master image. Here is my deployment file:

apiVersion: garo.tietoevry.com/v1alpha1
kind: GithubActionRunner
metadata:
  name: runner-poolsandbox
  namespace: github-actions-runner-operator
spec:
  minRunners: 1
  maxRunners: 6
  organization: jugo-io
  reconciliationPeriod: 1m
  tokenRef:
    key: GH_TOKEN
    name: actions-runner
  podTemplateSpec:
    metadata:
      annotations:
        prometheus.io/scrape: 'false'
        prometheus.io/port: '3903'
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchExpressions:
                    - key: garo.tietoevry.com/pool
                      operator: In
                      values:
                        - runner-poolsandbox
      containers:
        - name: runner
          env:
            - name: RUNNER_DEBUG
              value: 'true'
            - name: DOCKER_TLS_CERTDIR
              value: /certs
            - name: DOCKER_HOST
              value: 'tcp://localhost:2376'
            - name: DOCKER_TLS_VERIFY
              value: '1'
            - name: DOCKER_CERT_PATH
              value: /certs/client
            - name: GH_ORG
              value: jugo-io
            - name: RUNNER_SCOPE
              value: org
            - name: ORG_NAME
              value: jugo-io
            - name: ACCESS_TOKEN
              valueFrom:
                secretKeyRef:
                  name: actions-runner
                  key: GH_TOKEN
            - name: ACTIONS_RUNNER_INPUT_LABELS
              value: sandbox
            - name: LABELS
              value: 'self-hosted,sandbox'
            - name: ACTIONS_RUNNER_INPUT_EPHEMERAL
              value: 'true'
            - name: EPHEMERAL
              value: 'true'
          envFrom:
            - secretRef:
                name: runner-poolsandbox-regtoken
          image: 'quay.io/evryfs/github-actions-runner:myoung34-derivate'
          imagePullPolicy: IfNotPresent
          resources: {}
          volumeMounts:
            - mountPath: /certs
              name: docker-certs
            - mountPath: /home/runner/_diag
              name: runner-diag
            - mountPath: /home/runner/_work
              name: runner-work
        - name: docker
          env:
            - name: DOCKER_TLS_CERTDIR
              value: /certs
          image: 'docker:stable-dind'
          imagePullPolicy: Always
          args:
            - '--mtu=1430'
          resources: {}
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /var/lib/docker
              name: docker-storage
            - mountPath: /certs
              name: docker-certs
            - mountPath: /home/runner/_work
              name: runner-work
        - name: exporter
          image: 'quay.io/evryfs/github-actions-runner-metrics:v0.0.3'
          ports:
            - containerPort: 3903
              protocol: TCP
          volumeMounts:
            - name: runner-diag
              mountPath: /_diag
              readOnly: true
      volumes:
        - emptyDir: {}
          name: runner-work
        - emptyDir: {}
          name: runner-diag
        - emptyDir: {}
          name: mvn-repo
        - emptyDir: {}
          name: docker-storage
        - emptyDir: {}
          name: docker-certs

With this config, the pod starts, job runs, and runner container restarts and the pod remains. It never enters Completed state.

I think it's because of the problems in the operator logs, it doesn't seem to be able to scale/reconcile the pod for some reason using the myoung34 derivate image.

@joshrichards37
Copy link
Author

joshrichards37 commented Aug 31, 2022

Here's the behaviour captured from the runner when using myoung34 image and setting EPHEMERAL in the env vars:

runner-poolsandbox-pod-xvczz                                    3/3     Running       0          3m21s
runner-system-github-actions-runner-operator-57b65d6d6c-xp9vx   1/1     Running       0          23m




runner-poolsandbox-pod-xvczz                                    2/3     NotReady      0          5m6s
runner-poolsandbox-pod-xvczz                                    3/3     Running       1 (1s ago)   5m7s

As you can see, it restarts the runner container but the pod does not go in to Completed state, nor does the pod restart.

The whole time I was getting the log messages on the operator

2022-08-31T16:09:44.778Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-08-31T16:09:45.009Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}

Let me know if you need anything else

@joshrichards37
Copy link
Author

joshrichards37 commented Sep 1, 2022

So I have reprovisioned the cluster as there seemed to be some lingering resources with bad configuration breaking things. The runner container is restarting and that seems to clear down the _work directory which is an empty dir so that's fine. The only remaining issue now is that the operator isn't scaling the pods due to the API and the pod count being out of sync.

2022-09-01T13:28:07.304Z	INFO	controllers.GithubActionRunner	Reconciling GithubActionRunner	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}
2022-09-01T13:28:07.546Z	INFO	controllers.GithubActionRunner	Pods and runner API not in sync, returning early	{"githubactionrunner": "github-actions-runner-operator/runner-poolsandbox"}

I read through the code briefly (not very experienced with go) but looks like its because this isn't returning true:

func (r podRunnerPairList) inSync() bool {
return r.numPods() == r.numRunners()
}

We have one pod running at the moment but I'm wondering if the myoung34 derivate image is missing something potentially which stops the operator being able to recognise it as a runner pod or something?

@joshrichards37
Copy link
Author

Hi @davidkarlsen any update on this?

@davidkarlsen
Copy link
Collaborator

@joshrichards37
Only the derivate image support ephemeral runners.
Also make sure you run the latest version of the operator.
Does it work for you then?

@ankitjain28may
Copy link

@davidkarlsen I am having the same issue, running the myoung34 derivate image, operator isn't recognizing the github runner and not able to scale the runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants