-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
we need pirate RMA for multithreaded use cases #23
Comments
Locks, fence, and PSCW from multiple threads don't work. However, I believe it is legal to use lockall right after window creation and then This is not ideal and I would like to see a way for thread-scope synchronization (instead of today\s process-scope). I had opened #13 but I'm sure we can have a more elegant solution than info keys and window duplication... |
I like the idea of OSHMEM contexts. Whatever is the meaning of local and thread with e.g. argobots? It could be an add-on on the existing APIs. What is the meaning of a thread when an argobot thread migrates to a different OS thread? |
@devreal If that's the case, then Open-MPI is broken for multithreaded RMA, and it needs to stop issuing an error about incorrect synchronization usage. |
@jeffhammond is there an open issue for it in OMPI? What version of OMPI? A reproducer? I tried the following code and it works with both the generic and the UCX backend: MPI_Win_lock_all(0, win);
#pragma omp parallel
{
#pragma omp for
for (int k = 0; k < NUM_REPS; ++k) {
uint64_t res;
int target = k % size;
uint64_t val = 1;
MPI_Fetch_and_op(&val, &res, MPI_UINT64_T, target, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
} // omp parallel |
I never understood why request-based operations don't provide remote completion. I guess there were reasons over a decade ago... I think this would be a good addition to RMA. I'm not sure it can entirely solve the problem of thread-scope flushes though because tracking requests for large numbers of operations is potentially costly. |
@tschuett The duplicated window handles I proposed can be generalized to single-threaded contexts, without the need for binding their resources to any particular thread and without blowing up the API like shmem contexts did. |
@devreal The links above both report user problems. I guess you can reproduce with Kokkos Remote Spaces, but I haven't had time to do that. |
@devreal request based remote completion was rejected because the hardware people didn't like it, and the use cases weren't strong. |
But e.g. MPI_Alloc_Contexts(win, &contexts, 5); should be easy to add. MPI_Put_with_Context(win, contexts[3], ...); |
Another attempt: MPI 4.0 learned Partitioned Communication, i.e, multi-threaded message passing. There is already research on Partitioned Collectives, i.e., multi-threaded collectives. Maybe there is space for Partitioned RMA? The universe (threads) go through phases/epochs and within an epoch there are no overlapping operations. |
For the record: kokkos/kokkos-remote-spaces#51 is a non-issue, OMPI was correct in complaining when window locks are used in concurrent threads. Window locks are not thread-safe when concurrently accessing the same target. Still, the issue of multi-threaded RMA is real and we should talk about whether we want explicit contexts and/or remote completing request-based ops 👍 |
This isn't impossible. The solution today is to create a window for each thread, with the same window buffer. An advantage of this approach is that it allows threads to use the flush operations, which could be more efficient for some usage models versus fine-grain remote completion as proposed here. |
That only works with a fixed number of threads known a priori. Dynamically adding threads is impossible due to the collective nature My squabble with flush (and why I like remote-completing rput/raccumulate) is that flush is blocking, potentially depending on remote progress. The problem with rput/raccumulate as I mentioned above (and maybe what @jdinan refers to) is that each individual operation requires remote completion, which may be costly for large numbers of operations. To consolidate the two, nonblocking flush would provide both: a request for coarse remote completion semantics. I'm sure this has been discussed before and I'm curious why it had been rejected. |
It is only efficient with a fixed number of threads. Dynamically adding threads may require some threads to share a given window from a pool of windows created ahead of time. This could have performance consequences, but not correctness. What you really want is something like an OpenSHMEM context or maybe an aggregate handle. MPI made the unfortunate choice of tying memory registration/exposure together with the synchronization/memory models. We could introduce something like an |
If windows are long-lived, you could at any time do idk: |
@tschuett In the RMA memory model, each window context/dup would have overlapping window semantics and the object you get back from |
Problem
How does one do remote completion in a multi-threaded application? It's impossible, because one cannot do a flush on one thread at the same time as a RMA op on another thread. This is not a theoretical problem, as it has been seen by users:
If we assert one can do remote completion in a multi-threaded application with the current features, then we need to add text to this effect, so that it's clear that Open-MPI is incorrectly blaming user programs. @hjelmn
Solution
Request-based remote completion, which I proposed a decade ago. This means we add the following functions, which take two request arguments, one for local completion, and one for remote completion. For completeness, we should make it legal to pass
MPI_REQUEST_NULL
when these are not needed.The new functions would be:
MPI_Rrput(..,MPI_REQUEST_NULL,MPI_REQUEST_NULL) behaves like MPI_Put.
MPI_Rrput(..,&request,MPI_REQUEST_NULL) behaves like MPI_Rput.
MPI_Rrput(..,MPI_REQUEST_NULL,&request) can be locally completed with a local flush.
The text was updated successfully, but these errors were encountered: