Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

MEllis-github · 2023-08-22T17:28:05Z

WHY

Scenario (observed in practice):

Initial state:

a Pod pod-1 is Running on node x and is 1 of the minimum number of members required, as set in the MCAD scheduling spec.
the name of all the pods in the group is determined by the pod template (within the AppWrapper) and/or the corresponding controller (i.e. the pod spec may be included directly in the genericitem list or it may be embedded in a CRD such as a PyTorchJob. The important point is that the pod name does not change between MCAD’s initial dispatch and subsequent dispatches.

Problematic observed sequence of transitions and states:

Node x becomes unschedulable*.
The pod pod-1 enters a Terminating state.
The pod pod-1 remains in a Terminating state despite attempts to force-terminate the pod (automated within MCAD or manual with …delete —force=true —grace-period=0…). We have observed pods stuck like this for over 24hrs.
When MCAD observes the number of Running pods is less than minMembers, it removes the group and requeues the AppWrapper. This is expected.
Later, MCAD re-dispatches the AppWrapper while the pod pod-1 is still in a Terminating state.
Because pod-1 can still not be transitioned out of a Terminating state and still considered a member of the group, MCAD removes the group and requeues the AppWrapper for the same reason as before (insufficiently many members are Running).
This process repeats for as long as the pod is stuck in a Terminating state (which, again, we have observed to be hours, even over a day) and the job fails to make progress.

The cost:

The job fails to make progress, and may be delayed in restarting even after the pod finally terminates due to e.g. exponential backoff
Resources may idle when in fact there is work available

Existing workaround

Currently, once the above is manually identified to be the case for a given job, manually renaming the job/pods provides a workaround to get the job going again, however this takes time and in the meantime (e.g. overnight) the job does not make progress and resources may idle.

Call for solution ideas:

This is not an MCAD “bug” per se; it is an issue for any job/pod controller which keeps pod names constant between job restarts (of which there are multiple)
It is an opportunity for improved job resilience from MCAD in a job-controller-independent manner
Note, we have observed a variety of reasons why a node might become unschedule in practice, from unexpected hardware failures to intentional cordoning. The sought-for solution here is independent of the underlying cause of failure.

WHAT

What is being asked for?

Ideas/proposals for a MCAD-level (pod/job controller independent) solution.

HOW

Suggestions for how this may be solved. [Optional]

TESTS

List of related tests
The above description could be captured in even a single pod, single node test.

DONE

Bullet point items for what should be completed

The text was updated successfully, but these errors were encountered:

MEllis-github · 2023-08-22T17:52:06Z

If a complete solution from MCAD is not possible, one partial solution supporting external automation for handling this problem would be MCAD emitting a warning or error (and/or AppWrapper state transition) when pods-thought-to-have-been-terminated as part of the requeuing process are detected.

metalcycling · 2023-08-22T17:58:04Z

Good, we agree this is not an MCAD bug (per se as you put it). The name of the pod itself is not what matters here, in fact. It is the label that attaches the pod to the AppWrapper what matters. MCAD queries all the nodes and gets all the pods matching the AppWrapper label. This IS controller independent behavior and doesn't care how the generic items create the pods (or what name they give them for that matter) as long as the pods have the proper labels. I guess I want to understand exactly what it means to "rename" a job because, as I stated, names don't matter, labels do. I suppose what you are actually doing is resubmitting the AppWrapper with a different name which causes the Terminating pod to have the old label and thus not being counted anymore.

Ok, that is how it works, now what?. I honestly doubt there is anything MCAD can do here as it is working as it was designed: a terminating pod could be terminating because of preemption or because of something else. MCAD doesn't know why, so it treats all the cases the same and re-queues the job as many times as it needs (which can be controlled with the requeuing field, by the way).

....

Now that I think of it, there might actually be a very straightforward solution to this. When the job is requeued we change the labels and append the number of times the job has been requeued to the label. This would guarantee that the new requeuing looks like a "renaming" of the AppWrapper without having to actually do that.

MEllis-github · 2023-08-22T18:01:53Z

Excellent distinction, and idea, thank you!
Yes, we are deriving the label from the job name, but this is also not changed between re-queuings (or not currently if your proposal succeeds).

metalcycling · 2023-08-22T18:04:47Z

I'll work on this today and make a PR as soon as I can.

asm582 · 2023-08-22T18:36:17Z

I don't think changing the spec of the object is a good solution. I guess we can open ADR for this to be implemented.

asm582 · 2023-08-22T18:47:25Z

If the pod is stuck in the terminating state then I suspect it is waiting on some finalizer. a real fix could be that MCAD should not re-queue until the finalizer is satisfied or delete the object without the finalizer being honored. this could be a user intent represented by a flag where we do or do not honor the finalizer.

metalcycling · 2023-08-22T18:52:29Z

They need this ASAP. The ADR process would take longer.

asm582 · 2023-08-22T18:59:46Z

We can try to move ADRs quickly. Here is the link to ADR template, let try to keep ADR short with details about final design: https://github.com/project-codeflare/adr/blob/main/PCF-ADR-0000-template.md
do we have a final design ready with us?

metalcycling · 2023-08-22T19:01:51Z

We need to agree on this, so I'll hold working on the implementation until we have an agreement.

@asm582 is right in that changing the spec of the object is not ideal, but it would get the job done. Waiting for the pod to get out of terminating state so the finalizer gets removed is effectively preventing the job from making progress since that pod is going to be there for days. The example they mentioned even involves the node going away, so I don't know how the finalizer is supposed to be removed. To support the case when users want to respect the finalizer, we can make this a field of the spec.schedulingSpec.requeuing stanza so it can be turned off. So we support a mode that requeues and changes the names of the labels, and one that doesn't and respects finalizers. Would that sound good?

metalcycling · 2023-08-22T21:12:03Z

After doing some testing locally with my own finalizers simulating what happens when the node dies, I realized something. The problem is indeed the naming of the pods that are created. Lets take for example the training operator. Whenever a PyTorchJob is created by the operator, it spans pods with names [<name>-master-0, <name>-worker-0, <name>-worker-1, <name>-worker-2, ...]. Lets say <name>-worker-1 dies and is now in terminating status forever. Relaunching the PyTorchJob with the same <name> will create the exact same group of pods as shown before, of which <name>-worker-1 already exists and thus is not recreated as two pods cannot have the same name. That pod may be patched with the new label but it doesn't change the fact that it is terminating forever, so the problem is the same. The solution would be to remove the finalizers preventing the pod from being deleted, so it can be recreated after requeuing. So the solution would look like:

Pod dies and stays terminating forever due to finilizers (don't know of any other way that would cause this)
MCAD realizes there are less than minAvailable pods triggering requeuing
MCAD deletes the generic items AND for all the pods with the appwrapper label it removes the finializers so the pods can be deleted
MCAD redispatches the AppWrapper and things continue as intended

This shouldn't need an ADR as the pods are just modified to remove the finalizer so they can be deleted. The AppWrapper stays the same. I guess if we want to disable this finalizer deletion then the user can control that in the requeuing field and so for that we would need an ADR. However, for anyone using MCAD to manage jobs, I don't see why they would put finalizers to their generic items which would mess up MCAD's behavior.

Opinions @asm582 @MEllis-github

MEllis-github · 2023-08-22T21:52:50Z

I agree that a comprehensive solution also depends on the behavior of the respective controller. The training-operator behavior is of concern for PyTorchJob instances. Jobs are also submitted to our system as just Appwrap'ed Pod and PodGroup instances via torchX+MCAD. Users can also submit batch v1 jobs.
* For the training-operator, since this 2021 PR in kubeflow/common, it appears this scenario is handled. The replacement pod will have a unique UID. It's certainly worth including in the test suite though.
As an aside, for a comprehensive solution, there is also additional tuning we can do at the cluster scope. For example, here is an old (multi-year) issue with a number of suggestions that may still be relevant for us.
About solutions at the MCAD level: has it worked in local testing @metalcycling ? Have you been able to create the scenario where a node is lost?

metalcycling · 2023-08-22T22:16:20Z

@MEllis-github not really. I tried deleting nodes in the cluster but this doesn't do what you see. I imagine, internally, minikube does the proper cleanup before deleting a node, so I need a different strategy. I just want a way to be able to see this behavior in my local cluster so I can see why the pod stays in infinite termination. It seems that it is not about a finalizer so if that is the case, then what is it? Also for the record, even though I haven't been able to reproduce the test, yet, I know it is not a finalizer because calling kubectl delete --force on a pod that has a finalizer doesn't do anything until the finalizer is removed. So, if that is how you are recovering from a faulty node (by forcing deletions) then this is a different matter to address, and my solution no longer works.

MEllis-github · 2023-08-22T22:38:29Z

Building on this idea and this concept, does MCAD internally represent and track the expectation that a pod deleted while "rewrapping" and requeuing an AppWrapper should be gone before the next dispatch attempt? If the requeuing cycle can be stopped in the scenario where expectations are not met, that may be a first step.

MEllis-github · 2023-09-28T18:25:36Z

From discussion, the consensus is to address this in MCAD by introducing support for monitoring resource deletion i.e. MCAD will not only issue deletion of resources as it does now, but also ensure the deletion is complete before the next attempt to dispatch.

@tardieu please add any comments or clarifications you think would be helpful here. I just wanted to leave a quick status note.

github-project-automation bot added this to Project CodeFlare Sprint Board Aug 22, 2023

tardieu moved this to In Progress in Project CodeFlare Sprint Board Oct 9, 2023

metalcycling mentioned this issue Oct 20, 2023

Forceful deletion of remaining pods after cleanup #670

Draft

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

MEllis-github commented Aug 22, 2023 •

edited

Loading

MEllis-github commented Aug 22, 2023

metalcycling commented Aug 22, 2023 •

edited

Loading

MEllis-github commented Aug 22, 2023 •

edited

Loading

metalcycling commented Aug 22, 2023

asm582 commented Aug 22, 2023

asm582 commented Aug 22, 2023 •

edited

Loading

metalcycling commented Aug 22, 2023

asm582 commented Aug 22, 2023

metalcycling commented Aug 22, 2023 •

edited

Loading

metalcycling commented Aug 22, 2023

MEllis-github commented Aug 22, 2023 •

edited

Loading

metalcycling commented Aug 22, 2023

MEllis-github commented Aug 22, 2023 •

edited

Loading

MEllis-github commented Sep 28, 2023

Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599

Comments

MEllis-github commented Aug 22, 2023 • edited Loading

WHY

WHAT

HOW

TESTS

DONE

MEllis-github commented Aug 22, 2023

metalcycling commented Aug 22, 2023 • edited Loading

MEllis-github commented Aug 22, 2023 • edited Loading

metalcycling commented Aug 22, 2023

asm582 commented Aug 22, 2023

asm582 commented Aug 22, 2023 • edited Loading

metalcycling commented Aug 22, 2023

asm582 commented Aug 22, 2023

metalcycling commented Aug 22, 2023 • edited Loading

metalcycling commented Aug 22, 2023

MEllis-github commented Aug 22, 2023 • edited Loading

metalcycling commented Aug 22, 2023

MEllis-github commented Aug 22, 2023 • edited Loading

MEllis-github commented Sep 28, 2023

MEllis-github commented Aug 22, 2023 •

edited

Loading

metalcycling commented Aug 22, 2023 •

edited

Loading

MEllis-github commented Aug 22, 2023 •

edited

Loading

asm582 commented Aug 22, 2023 •

edited

Loading

metalcycling commented Aug 22, 2023 •

edited

Loading

MEllis-github commented Aug 22, 2023 •

edited

Loading

MEllis-github commented Aug 22, 2023 •

edited

Loading