Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finish hyperuqeue jobs instead of cancelling in load balancer #51

Open
linusseelinger opened this issue Feb 12, 2024 · 4 comments
Open
Assignees

Comments

@linusseelinger
Copy link
Member

No description provided.

@jonmaddock
Copy link

Hi, I also experience this, and I found it confusing as I thought my evaluations were failing! It looks like:

...
Waiting for job 21 to start.
Job 21 started.
2024-11-18T15:38:10Z INFO Job 21 canceled (1 tasks canceled, 0 tasks already finished)
Waiting for job 22 to start.

despite the evaluation being successful.

@linusseelinger
Copy link
Member Author

Hi @jonmaddock , this is, as it stands, expected. We don't have a cleaner way of shutting down a UM-Bridge server inside a HQ job after model evaluation.

I do however see that it's not ideal since the cancellation looks like an error. Maybe we could make servers optionally (via env. variable) accept a termination signal from client side. We discussed such a signal before, and it should not be default behaviour, but opt-in via env. variable set in the job scripts would be acceptable I think...

@annereinarz @chun9l Do you have any opinions on that?

@monabraeunig maybe an interesting next task for you after error handling?

@jonmaddock
Copy link

@linusseelinger thanks for your response. I completely understand you reasoning; the behaviour was confusing for a newcomer, that's all. Perhaps I could add this current behaviour to the documentation for HPC if we don't proceed with this fix?

@linusseelinger
Copy link
Member Author

@jonmaddock that'd be great! Let's first see if we can come up with a clean solution regarding termination though, cancelleld jobs have already surprised other users before

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants