Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remember already submitted htcondor jobs to avoid re-submitting #167

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

meliache
Copy link
Collaborator

@meliache meliache commented Apr 5, 2022

TODO's

  • avoid too many opened files error
  • add unit-tests?
  • use this for a while for real-life job submission to check it does what's expected

@meliache meliache added enhancement New feature or request htcondor concerns the htcondor batch system labels Apr 5, 2022
@meliache meliache self-assigned this Apr 5, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #167 (caba809) into main (bd14265) will decrease coverage by 0.52%.
The diff coverage is 11.76%.

@@            Coverage Diff             @@
##             main     #167      +/-   ##
==========================================
- Coverage   59.73%   59.21%   -0.53%     
==========================================
  Files          23       23              
  Lines        1530     1547      +17     
==========================================
+ Hits          914      916       +2     
- Misses        616      631      +15     
Impacted Files Coverage Δ
b2luigi/batch/processes/htcondor.py 55.88% <11.76%> (-6.31%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bd14265...caba809. Read the comment docs.

@meliache
Copy link
Collaborator Author

meliache commented Apr 5, 2022

While testing I got the following error after a while and I'm trying to find out how it's related:

INFO: Worker Worker(salt=572125441, workers=800, host=naf-belle11.desy.de, username=meliache, pid=28250) was stopped. Shutting down Keep-Alive thread
Traceback (most recent call last):
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/worker.py", line 1208, in run
    self._run_task(get_work_response.task_id)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/worker.py", line 1012, in _run_task
    task_process.run()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/__init__.py", line 126, in run
    self.start_job()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 201, in start_job
    output = subprocess.check_output(["condor_submit", submit_file], cwd=submit_file_dir)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 808, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 1484, in _get_handles
    c2pread, c2pwrite = os.pipe()
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_naf_reconstruction.py", line 81, in <module>
    b2luigi.process(
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/cli/process.py", line 113, in process
    runner.run_local(task_list, cli_args, kwargs)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 46, in run_local
    run_luigi(task_list, cli_args, kwargs)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/cli/runner.py", line 62, in run_luigi
    luigi.build(task_list, **kwargs)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/interface.py", line 237, in build
    luigi_run_result = _schedule_and_run(tasks, worker_scheduler_factory, override_defaults=env_params)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/luigi/worker.py", line 607, in __exit__
    if task.is_alive():
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/__init__.py", line 135, in is_alive
    job_status = self.get_job_status()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 166, in get_job_status
    job_status = _batch_job_status_cache[self._batch_job_id]
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/site-packages/cachetools/__init__.py", line 371, in __getitem__
    return self.__missing__(key)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/cache.py", line 27, in __missing__
    self._ask_for_job_status(job_id=None)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/retry/api.py", line 80, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/retry/api.py", line 32, in __retry_internal
    return f()
  File "/afs/desy.de/user/m/meliache/.local/lib/python3.8/site-packages/b2luigi/batch/processes/htcondor.py", line 51, in _ask_for_job_status
    output = subprocess.check_output(q_cmd)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 808, in __init__
    errread, errwrite) = self._get_handles(stdin, stdout, stderr)
  File "/cvmfs/belle.cern.ch/el7/externals/v01-11-00/Linux_x86_64/common/lib/python3.8/subprocess.py", line 1484, in _get_handles
    c2pread, c2pwrite = os.pipe()
OSError: [Errno 24] Too many open files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed htcondor concerns the htcondor batch system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants