Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First 8 jobs run by load-balancer appear to have bus errors #88

Open
jonmaddock opened this issue Nov 18, 2024 · 3 comments
Open

First 8 jobs run by load-balancer appear to have bus errors #88

jonmaddock opened this issue Nov 18, 2024 · 3 comments

Comments

@jonmaddock
Copy link

Hello, when running the load-balancer on HPC, the first 8 jobs invariably seem to not actually perform any model evaluations, and give "bus error"s. They also appear to run serially, and it can take a few minutes to get through them all before job9 onwards runs and actually evaluates the model. Example of load-balancer output when not actually evaluating models:

...
Load balancer running port4242
Listening on port 4242...
Waiting for job 2 to start.
Job 2 started.
2024-11-18T15:29:53Z INFO Job 2 canceled (1 tasks canceled, 0 tasks already finished)
Waiting for job 3 to start.
Job 3 started.
...

Job stdout is only:

Waiting for model server to respond at XX.XX.XX.XX:XXXXX...
Model server responded
======== Running on http://X.X.X.X:XXXXX ========
(Press CTRL+C to quit)

stderr is:

Bus error

After job 8 has run, job 9 onwards run fine with no bus errors and correct evaluations.

Any ideas if I'm doing something wrong? Thanks!

@linusseelinger
Copy link
Member

It's expected that we get some initial "dummy" jobs since the load balancer queries some information from the model.

I haven't come across the bus error though; could this be related to the model? e.g. is there any logic happening in your model server when requesting input/output dimensions that could possibly cause errors?

@jonmaddock
Copy link
Author

Ok, that's good to know about the initial dummy jobs. Indeed, there is some logic in my model's methods, so that could be the reason, if those arguments aren't what I expect them to be. I used models/testmodel-python/minimal-server.py as a template, but my get_input_sizes(self, config) method expects a config dictionary containing certain keys, which is what is provided by my client. If it's called by something else (i.e. not my client) or with an empty config, then that could be the problem.

I did this to have flexible input dimensions, but I'll try hard-coding them to see if it make the "bus error" disappear. Thanks again @linusseelinger.

@linusseelinger
Copy link
Member

This might be the cause... I'm not 100% sure if the load balancer really queries input dimensions, but at least it'll ask the model server for the available model names and supported features (evaluation/gradient/ etc.).

In any case, it's best if the server can be called with an empty config (going with default values in that case); makes it a bit more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants