we need pirate RMA for multithreaded use cases #23

jeffhammond · 2022-08-03T05:25:36Z

Problem

How does one do remote completion in a multi-threaded application? It's impossible, because one cannot do a flush on one thread at the same time as a RMA op on another thread. This is not a theoretical problem, as it has been seen by users:

If we assert one can do remote completion in a multi-threaded application with the current features, then we need to add text to this effect, so that it's clear that Open-MPI is incorrectly blaming user programs. @hjelmn

Solution

Request-based remote completion, which I proposed a decade ago. This means we add the following functions, which take two request arguments, one for local completion, and one for remote completion. For completeness, we should make it legal to pass MPI_REQUEST_NULL when these are not needed.

The new functions would be:

MPI_Rrput
MPI_Rraccumulate
MPI_Rrget_accumulate (not sure about MPI_Rrrget_accumulate)

MPI_Rrput(..,MPI_REQUEST_NULL,MPI_REQUEST_NULL) behaves like MPI_Put.
MPI_Rrput(..,&request,MPI_REQUEST_NULL) behaves like MPI_Rput.
MPI_Rrput(..,MPI_REQUEST_NULL,&request) can be locally completed with a local flush.

The text was updated successfully, but these errors were encountered:

devreal · 2022-08-03T12:15:04Z

Locks, fence, and PSCW from multiple threads don't work. However, I believe it is legal to use lockall right after window creation and then MPI_Put|Get+MPI_Win_flush in concurrent threads. The operations and flushes will be executed in some order, the only constraint being that each flush must wait for all operations previously issue by the process. This has worked for me in the past.

This is not ideal and I would like to see a way for thread-scope synchronization (instead of today\s process-scope). I had opened #13 but I'm sure we can have a more elegant solution than info keys and window duplication...

tschuett · 2022-08-03T12:29:54Z

I like the idea of OSHMEM contexts. Whatever is the meaning of local and thread with e.g. argobots? It could be an add-on on the existing APIs. What is the meaning of a thread when an argobot thread migrates to a different OS thread?

jeffhammond · 2022-08-03T12:33:10Z

@devreal If that's the case, then Open-MPI is broken for multithreaded RMA, and it needs to stop issuing an error about incorrect synchronization usage.

devreal · 2022-08-03T13:18:00Z

@jeffhammond is there an open issue for it in OMPI? What version of OMPI? A reproducer? I tried the following code and it works with both the generic and the UCX backend:

MPI_Win_lock_all(0, win);

#pragma omp parallel
{
#pragma omp for
  for (int k = 0; k < NUM_REPS; ++k) {
    uint64_t res;
    int target = k % size;
    uint64_t val = 1;
    MPI_Fetch_and_op(&val, &res, MPI_UINT64_T, target, 0, MPI_SUM, win);
    MPI_Win_flush(target, win);
  }
} // omp parallel

devreal · 2022-08-03T13:24:01Z

I never understood why request-based operations don't provide remote completion. I guess there were reasons over a decade ago... I think this would be a good addition to RMA. I'm not sure it can entirely solve the problem of thread-scope flushes though because tracking requests for large numbers of operations is potentially costly.

devreal · 2022-08-03T13:36:29Z

@tschuett The duplicated window handles I proposed can be generalized to single-threaded contexts, without the need for binding their resources to any particular thread and without blowing up the API like shmem contexts did.

jeffhammond · 2022-08-03T13:37:00Z

@devreal The links above both report user problems. I guess you can reproduce with Kokkos Remote Spaces, but I haven't had time to do that.

jeffhammond · 2022-08-03T13:37:41Z

@devreal request based remote completion was rejected because the hardware people didn't like it, and the use cases weren't strong.

tschuett · 2022-08-03T13:39:36Z

But e.g. MPI_Alloc_Contexts(win, &contexts, 5); should be easy to add.

MPI_Put_with_Context(win, contexts[3], ...);

tschuett · 2022-08-03T13:52:46Z

Another attempt: MPI 4.0 learned Partitioned Communication, i.e, multi-threaded message passing. There is already research on Partitioned Collectives, i.e., multi-threaded collectives. Maybe there is space for Partitioned RMA? The universe (threads) go through phases/epochs and within an epoch there are no overlapping operations.

devreal · 2022-08-03T22:30:09Z

For the record: kokkos/kokkos-remote-spaces#51 is a non-issue, OMPI was correct in complaining when window locks are used in concurrent threads. Window locks are not thread-safe when concurrently accessing the same target.

Still, the issue of multi-threaded RMA is real and we should talk about whether we want explicit contexts and/or remote completing request-based ops 👍

jdinan · 2022-08-04T15:55:36Z

This isn't impossible. The solution today is to create a window for each thread, with the same window buffer. An advantage of this approach is that it allows threads to use the flush operations, which could be more efficient for some usage models versus fine-grain remote completion as proposed here.

devreal · 2022-08-04T16:15:45Z

That only works with a fixed number of threads known a priori. Dynamically adding threads is impossible due to the collective nature MPI_Win_create.

My squabble with flush (and why I like remote-completing rput/raccumulate) is that flush is blocking, potentially depending on remote progress. The problem with rput/raccumulate as I mentioned above (and maybe what @jdinan refers to) is that each individual operation requires remote completion, which may be costly for large numbers of operations. To consolidate the two, nonblocking flush would provide both: a request for coarse remote completion semantics. I'm sure this has been discussed before and I'm curious why it had been rejected.

jdinan · 2022-08-04T16:26:23Z

It is only efficient with a fixed number of threads. Dynamically adding threads may require some threads to share a given window from a pool of windows created ahead of time. This could have performance consequences, but not correctness. What you really want is something like an OpenSHMEM context or maybe an aggregate handle. MPI made the unfortunate choice of tying memory registration/exposure together with the synchronization/memory models. We could introduce something like an MPI_Win_dup_local for situations like this where you want to tease them apart.

tschuett · 2022-08-04T16:38:16Z

If windows are long-lived, you could at any time do idk:
MPI_Win_Alloc_Context(win, &next_context, 1);
If I have to choose between duplicating windows to get handles or asking the window for another handle, then I would prefer the second option.

jdinan · 2022-08-15T15:09:50Z

@tschuett In the RMA memory model, each window context/dup would have overlapping window semantics and the object you get back from MPI_Win_alloc_context or MPI_Win_dup_local would be of type MPI_Win to be usable to RMA routines. Do you see places where the semantics would differ, or are these just two names for the same concept?

janciesko mentioned this issue Aug 3, 2022

[WIP] Uncomment the use of MPI_Win_lock/MPI_Win_unlock kokkos/kokkos-remote-spaces#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

we need pirate RMA for multithreaded use cases #23

we need pirate RMA for multithreaded use cases #23

jeffhammond commented Aug 3, 2022 •

edited

Loading

devreal commented Aug 3, 2022

tschuett commented Aug 3, 2022

jeffhammond commented Aug 3, 2022

devreal commented Aug 3, 2022

devreal commented Aug 3, 2022

devreal commented Aug 3, 2022

jeffhammond commented Aug 3, 2022

jeffhammond commented Aug 3, 2022

tschuett commented Aug 3, 2022

tschuett commented Aug 3, 2022 •

edited

Loading

devreal commented Aug 3, 2022

jdinan commented Aug 4, 2022

devreal commented Aug 4, 2022

jdinan commented Aug 4, 2022

tschuett commented Aug 4, 2022

jdinan commented Aug 15, 2022

we need pirate RMA for multithreaded use cases #23

we need pirate RMA for multithreaded use cases #23

Comments

jeffhammond commented Aug 3, 2022 • edited Loading

Problem

Solution

devreal commented Aug 3, 2022

tschuett commented Aug 3, 2022

jeffhammond commented Aug 3, 2022

devreal commented Aug 3, 2022

devreal commented Aug 3, 2022

devreal commented Aug 3, 2022

jeffhammond commented Aug 3, 2022

jeffhammond commented Aug 3, 2022

tschuett commented Aug 3, 2022

tschuett commented Aug 3, 2022 • edited Loading

devreal commented Aug 3, 2022

jdinan commented Aug 4, 2022

devreal commented Aug 4, 2022

jdinan commented Aug 4, 2022

tschuett commented Aug 4, 2022

jdinan commented Aug 15, 2022

jeffhammond commented Aug 3, 2022 •

edited

Loading

tschuett commented Aug 3, 2022 •

edited

Loading