Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement: Async gbasf2 submission and/or download to avoid delay of scheduling #129

Open
meliache opened this issue Sep 23, 2021 · 2 comments
Labels
enhancement New feature or request gbasf2 Concerns the gbasf2/grid b2luigi wrapper help wanted Extra attention is needed

Comments

@meliache
Copy link
Collaborator

The gbasf2 submission and dataset download operations take a long time. Even when remote workers work in parallel, scheduling happens by default in serial. (Except when the parallel_scheduling config option is set tue true. However, this didn't work for me, if you had success with it please message me.) The long gbasf2 submission and the dataset download seem to block the scheduling until that operation is done. This is something that I can live with, since usually only few gbasf2 projects are required, but it would be cool to do something about it.

This gbasf2 dataset download is currently triggered in the get_job_status method as a subroutine call when the gbasf2 project is all done. Maybe we can call initiate the download as an async subprocess and only mark the job as really complete when the download is done. At least when the gbasf2_download_dataset b2luigi option is set.

Something similar might be done for the submission.

This is not easy and I don't know if we can do both cases. The subprocess sometimes might require user input, e.g. and ca-certificate or ssh key password, so this should still work. And error handling should also be thought about. As I have not much experience with async subprocesses, I'd be happy about help.

If I'm just too stupid for parallel_scheduling and with that properly enabled these blocking operations are no problem, then this can be closed. (Though parallel_scheduling also only works for pickable tasks.)

@meliache meliache added enhancement New feature or request gbasf2 Concerns the gbasf2/grid b2luigi wrapper help wanted Extra attention is needed labels Sep 23, 2021
@Bilokin
Copy link
Contributor

Bilokin commented Oct 26, 2021

This is a very interesting functionality for our project. I thought if one can split the Basf2PathTask which runs the grid into separate luigi tasks, like JobSubmissionTask, JobMonitoringTask and DatasetDownloadingTask, which might help to parallelize the code, if that makes sense

@Bilokin
Copy link
Contributor

Bilokin commented Oct 13, 2022

Hi @meliache,

this ticket is the closest to the topic I would like to raise.
The gbasf2 project submission algorithm does not submit all projects first and then waits for them to finish, but rather some project submissions happen after start of the project monitoring.
This is not optimal and we need to ensure that all gbasf2 have been submitted at the start of the b2luigi process.
I am still not sure why this happens, but do you have an idea how to fix the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request gbasf2 Concerns the gbasf2/grid b2luigi wrapper help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants