fork() might not be OK #201

inducer · 2021-01-21T14:23:39Z

I'm seeing new failures, after a software upgrade on on the CI machines: https://gitlab.tiker.net/inducer/grudge/-/jobs/220795

Note: no IB hardware there, but libibverbs (I think?) is linked in and still complains and warns of performance degradation.

We currently fork in a few places:

Running version.sh
JIT OpenCL compilation might? I think at least the POCL CPU linker currently is a binary
Mesh generation

The text was updated successfully, but these errors were encountered:

matthiasdiener · 2021-01-21T14:31:42Z

This looks very strange. Why would that show up as an error now? Has there been a new pip release?

edit:

+++ conda install --quiet --yes pip<20.2
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
  environment location: /var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing
  added / updated specs:
    - pip[version='<20.2']
The following packages will be downloaded:
    package                    |            build
    ---------------------------|-----------------
    islpy-2020.2               |   py38hade25e6_2         3.0 MB  conda-forge
    markupsafe-1.1.1           |   py38h497a2fe_3          27 KB  conda-forge
    numpy-1.19.5               |   py38h18fd61f_1         5.4 MB  conda-forge
    pip-20.0.2                 |           py38_1         1.9 MB  conda-forge
    pyopencl-2020.3.1          |   py38hc10631b_0         706 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        11.0 MB
The following packages will be SUPERSEDED by a higher-priority channel:
  pip                conda-forge/noarch::pip-20.3.3-pyhd8e~ --> conda-forge/linux-64::pip-20.0.2-py38_1
The following packages will be DOWNGRADED:

👀

inducer · 2021-01-21T21:04:02Z

@matthiasdiener Wrong issue? Did you intend this for #199?

matthiasdiener · 2021-01-21T21:07:31Z

@matthiasdiener Wrong issue? Did you intend this for #199?

No, I think that showed in the log that you linked to above, and I found it curious.

inducer · 2021-01-21T23:01:29Z

I see! I agree that it's weird, though this failure is what I was talking about:

_________________________________ test_mpi[2] __________________________________
[gw0] linux -- Python 3.8.6 /var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing/bin/python3
Traceback (most recent call last):
  File "/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/test/test_mpi_communication.py", line 263, in test_mpi
    check_call([
  File "/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['mpiexec', '-np', '2', '-x', 'RUN_WITHIN_MPI=1', '-x', 'TEST_MPI_COMMUNICATION=1', '/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/.miniforge3/envs/testing/bin/python3', '/var/lib/gitlab-runner/builds/ed0e33ab/2/inducer/grudge/test/test_mpi_communication.py']' returned non-zero exit status 134.
----------------------------- Captured stdout call -----------------------------
[1611206036.054458] [porter:1982262:0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
[1611206036.054491] [porter:1982262:0]     ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
[1611206036.168055] [porter:1982263:0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
[1611206036.168083] [porter:1982263:0]     ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
----------------------------- Captured stderr call -----------------------------
[porter:1982262] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273  Error: Failed to create UCP worker
[porter:1982263] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273  Error: Failed to create UCP worker
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
A process has executed an operation involving a call
to the fork() system call to create a child process.
As a result, the libfabric EFA provider is operating in
a condition that could result in memory corruption or
other system errors.
For the libfabric EFA provider to work safely when fork()
is called, you will need to set the following environment
variable:
          RDMAV_FORK_SAFE
However, setting this environment variable can result in
signficant performance impact to your application due to
increased cost of memory registration.
You may want to check with your application vendor to see
if an application-level alternative (of not using fork)
exists.
Your job will now abort.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node porter exited on signal 6 (Aborted).
--------------------------------------------------------------------------

matthiasdiener · 2021-01-22T15:22:57Z

Have you recently updated the machine? My guess is that:

UCX  ERROR rdma_create_event_channel failed: No such device

is the main error, the fork() error is just happening as a side effect afterwards.

See e.g. here for some discussion:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=980033

inducer · 2021-01-22T21:26:11Z

Interestingly, the issue (in the CI run I quoted above) occurred only for the build using a Conda environment, not for the (otherwise equivalent) build in a virtualenv.

inducer · 2021-01-27T02:56:16Z

More evidence that fork might not be OK: inducer/loopy#204.

inducer changed the title ~~Fork might not be OK~~ fork() might not be OK Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fork() might not be OK #201

fork() might not be OK #201

inducer commented Jan 21, 2021 •

edited

Loading

matthiasdiener commented Jan 21, 2021 •

edited

Loading

inducer commented Jan 21, 2021

matthiasdiener commented Jan 21, 2021

inducer commented Jan 21, 2021

matthiasdiener commented Jan 22, 2021 •

edited

Loading

inducer commented Jan 22, 2021 •

edited

Loading

inducer commented Jan 27, 2021

fork() might not be OK #201

fork() might not be OK #201

Comments

inducer commented Jan 21, 2021 • edited Loading

matthiasdiener commented Jan 21, 2021 • edited Loading

inducer commented Jan 21, 2021

matthiasdiener commented Jan 21, 2021

inducer commented Jan 21, 2021

matthiasdiener commented Jan 22, 2021 • edited Loading

inducer commented Jan 22, 2021 • edited Loading

inducer commented Jan 27, 2021

inducer commented Jan 21, 2021 •

edited

Loading

matthiasdiener commented Jan 21, 2021 •

edited

Loading

matthiasdiener commented Jan 22, 2021 •

edited

Loading

inducer commented Jan 22, 2021 •

edited

Loading