Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fork() might not be OK #201

Open
inducer opened this issue Jan 21, 2021 · 7 comments
Open

fork() might not be OK #201

inducer opened this issue Jan 21, 2021 · 7 comments

Comments

@inducer
Copy link
Contributor

inducer commented Jan 21, 2021

I'm seeing new failures, after a software upgrade on on the CI machines: https://gitlab.tiker.net/inducer/grudge/-/jobs/220795

Note: no IB hardware there, but libibverbs (I think?) is linked in and still complains and warns of performance degradation.

We currently fork in a few places:

  • Running version.sh
  • JIT OpenCL compilation might? I think at least the POCL CPU linker currently is a binary
  • Mesh generation
@inducer inducer changed the title Fork might not be OK fork() might not be OK Jan 21, 2021
@matthiasdiener
Copy link
Member

matthiasdiener commented Jan 21, 2021

This looks very strange. Why would that show up as an error now? Has there been a new pip release?

edit:

+++ conda install --quiet --yes pip<20.2
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
  environment location: /var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing
  added / updated specs:
    - pip[version='<20.2']
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    islpy-2020.2               |   py38hade25e6_2         3.0 MB  conda-forge
    markupsafe-1.1.1           |   py38h497a2fe_3          27 KB  conda-forge
    numpy-1.19.5               |   py38h18fd61f_1         5.4 MB  conda-forge
    pip-20.0.2                 |           py38_1         1.9 MB  conda-forge
    pyopencl-2020.3.1          |   py38hc10631b_0         706 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        11.0 MB
The following packages will be SUPERSEDED by a higher-priority channel:
  pip                conda-forge/noarch::pip-20.3.3-pyhd8e~ --> conda-forge/linux-64::pip-20.0.2-py38_1
The following packages will be DOWNGRADED:

👀

@inducer
Copy link
Contributor Author

inducer commented Jan 21, 2021

@matthiasdiener Wrong issue? Did you intend this for #199?

@matthiasdiener
Copy link
Member

@matthiasdiener Wrong issue? Did you intend this for #199?

No, I think that showed in the log that you linked to above, and I found it curious.

@inducer
Copy link
Contributor Author

inducer commented Jan 21, 2021

I see! I agree that it's weird, though this failure is what I was talking about:

_________________________________ test_mpi[2] __________________________________
[gw0] linux -- Python 3.8.6 /var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing/bin/python3
Traceback (most recent call last):
  File "/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/test/test_mpi_communication.py", line 263, in test_mpi
    check_call([
  File "/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpiexec', '-np', '2', '-x', 'RUN_WITHIN_MPI=1', '-x', 'TEST_MPI_COMMUNICATION=1', '/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing/bin/python3', '/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/test/test_mpi_communication.py']' returned non-zero exit status 134.
----------------------------- Captured stdout call -----------------------------
[1611206036.054458] [porter:1982262:0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
[1611206036.054491] [porter:1982262:0]     ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
[1611206036.168055] [porter:1982263:0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
[1611206036.168083] [porter:1982263:0]     ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
----------------------------- Captured stderr call -----------------------------
[porter:1982262] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273  Error: Failed to create UCP worker
[porter:1982263] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273  Error: Failed to create UCP worker
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node porter exited on signal 6 (Aborted).
--------------------------------------------------------------------------

@matthiasdiener
Copy link
Member

matthiasdiener commented Jan 22, 2021

Have you recently updated the machine? My guess is that:

UCX  ERROR rdma_create_event_channel failed: No such device

is the main error, the fork() error is just happening as a side effect afterwards.

See e.g. here for some discussion:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=980033

@inducer
Copy link
Contributor Author

inducer commented Jan 22, 2021

Interestingly, the issue (in the CI run I quoted above) occurred only for the build using a Conda environment, not for the (otherwise equivalent) build in a virtualenv.

@inducer
Copy link
Contributor Author

inducer commented Jan 27, 2021

More evidence that fork might not be OK: inducer/loopy#204.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants