You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, when running the load-balancer on HPC, the first 8 jobs invariably seem to not actually perform any model evaluations, and give "bus error"s. They also appear to run serially, and it can take a few minutes to get through them all before job9 onwards runs and actually evaluates the model. Example of load-balancer output when not actually evaluating models:
...
Load balancer running port4242
Listening on port 4242...
Waiting for job 2 to start.
Job 2 started.
2024-11-18T15:29:53Z INFO Job 2 canceled (1 tasks canceled, 0 tasks already finished)
Waiting for job 3 to start.
Job 3 started.
...
Job stdout is only:
Waiting for model server to respond at XX.XX.XX.XX:XXXXX...
Model server responded
======== Running on http://X.X.X.X:XXXXX ========
(Press CTRL+C to quit)
stderr is:
Bus error
After job 8 has run, job 9 onwards run fine with no bus errors and correct evaluations.
Any ideas if I'm doing something wrong? Thanks!
The text was updated successfully, but these errors were encountered:
It's expected that we get some initial "dummy" jobs since the load balancer queries some information from the model.
I haven't come across the bus error though; could this be related to the model? e.g. is there any logic happening in your model server when requesting input/output dimensions that could possibly cause errors?
Ok, that's good to know about the initial dummy jobs. Indeed, there is some logic in my model's methods, so that could be the reason, if those arguments aren't what I expect them to be. I used models/testmodel-python/minimal-server.py as a template, but my get_input_sizes(self, config) method expects a config dictionary containing certain keys, which is what is provided by my client. If it's called by something else (i.e. not my client) or with an empty config, then that could be the problem.
I did this to have flexible input dimensions, but I'll try hard-coding them to see if it make the "bus error" disappear. Thanks again @linusseelinger.
This might be the cause... I'm not 100% sure if the load balancer really queries input dimensions, but at least it'll ask the model server for the available model names and supported features (evaluation/gradient/ etc.).
In any case, it's best if the server can be called with an empty config (going with default values in that case); makes it a bit more robust.
Hello, when running the load-balancer on HPC, the first 8 jobs invariably seem to not actually perform any model evaluations, and give "bus error"s. They also appear to run serially, and it can take a few minutes to get through them all before job9 onwards runs and actually evaluates the model. Example of load-balancer output when not actually evaluating models:
Job stdout is only:
stderr is:
After job 8 has run, job 9 onwards run fine with no bus errors and correct evaluations.
Any ideas if I'm doing something wrong? Thanks!
The text was updated successfully, but these errors were encountered: