Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reapExecutions should not run in Kubernetes #3839

Open
godber opened this issue Nov 21, 2024 · 2 comments
Open

reapExecutions should not run in Kubernetes #3839

godber opened this issue Nov 21, 2024 · 2 comments

Comments

@godber
Copy link
Member

godber commented Nov 21, 2024

I think the way this reapExecutions is used ends up being in a race with the Kubernetes pod shutdown timeout:

I don't think we see this all that often, but if I am understanding this correctly, it's managing the job state on a timer independent of the other things managing job state.

Realistically we don't see this very often (I see it once in the last five days) and the consequence isn't that bad. But it confuses the separation of concerns for how a job gets moved to stopped.

We should find the "normal" way of how an execution gets moved to stopped to double check my logic here.

@busma13
Copy link
Contributor

busma13 commented Nov 21, 2024

"normal way":
Using API to call stop on a job:

  • ExecutionService.stopExecution():
    • checks if in a terminal state, if so logs it and returns
    • if stopping already:
      1. calls waitForExecutionStatus() in background
        • checks clusterState until the ex_id is removed from active list, then sets status to stopped.
        • Active list is created from response to k8s.list pods, so a stuck pod would remain 'active'.
    • if not stopped yet:
      1. sets status to stopping
      2. calls clusterService.stopExecution()
        • calls k8s.deleteExecution
          • deletes the k8s job
      3. calls waitForExecutionStatus() in background
  • if blocking is true the api will waitForStop(), meaning it will check the execution status until it is stopped before returning

@busma13
Copy link
Contributor

busma13 commented Nov 21, 2024

If reapExecutions() is removed then we might eventually see a job stuck in stopping. This would be a clue that a pod did not shutdown properly and the job needs to be force stopped to clean up the resources.

I'm guessing that the majority of the time this is used the pod shuts down right after we change the status to stopped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants