reapExecutions should not run in Kubernetes #3839

godber · 2024-11-21T00:38:34Z

I think the way this reapExecutions is used ends up being in a race with the Kubernetes pod shutdown timeout:

teraslice/packages/teraslice/src/lib/cluster/services/execution.ts

Line 515 in f1bd147

async reapExecutions() {

I don't think we see this all that often, but if I am understanding this correctly, it's managing the job state on a timer independent of the other things managing job state.

Realistically we don't see this very often (I see it once in the last five days) and the consequence isn't that bad. But it confuses the separation of concerns for how a job gets moved to stopped.

We should find the "normal" way of how an execution gets moved to stopped to double check my logic here.

The text was updated successfully, but these errors were encountered:

busma13 · 2024-11-21T16:53:18Z

"normal way":
Using API to call stop on a job:

ExecutionService.stopExecution():
- checks if in a terminal state, if so logs it and returns
- if stopping already:
  1. calls waitForExecutionStatus() in background
    - checks clusterState until the ex_id is removed from active list, then sets status to stopped.
    - Active list is created from response to k8s.list pods, so a stuck pod would remain 'active'.
- if not stopped yet:
  1. sets status to stopping
  2. calls clusterService.stopExecution()
    - calls k8s.deleteExecution
      - deletes the k8s job
  3. calls waitForExecutionStatus() in background
if blocking is true the api will waitForStop(), meaning it will check the execution status until it is stopped before returning

busma13 · 2024-11-21T19:04:51Z

If reapExecutions() is removed then we might eventually see a job stuck in stopping. This would be a clue that a pod did not shutdown properly and the job needs to be force stopped to clean up the resources.

I'm guessing that the majority of the time this is used the pod shuts down right after we change the status to stopped.

godber added bug pkg/teraslice labels Nov 21, 2024

godber assigned jsnoble, busma13 and sotojn Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reapExecutions should not run in Kubernetes #3839

reapExecutions should not run in Kubernetes #3839

godber commented Nov 21, 2024

busma13 commented Nov 21, 2024 •

edited

Loading

busma13 commented Nov 21, 2024

reapExecutions should not run in Kubernetes #3839

reapExecutions should not run in Kubernetes #3839

Comments

godber commented Nov 21, 2024

busma13 commented Nov 21, 2024 • edited Loading

busma13 commented Nov 21, 2024

busma13 commented Nov 21, 2024 •

edited

Loading