-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Uncomment the use of MPI_Win_lock/MPI_Win_unlock #51
Conversation
@jeffhammond - Jeff, is this the correct usage pattern for the sequence of lock/unlock and flush? The code us multi-threaded. I am seeing the error as described above. |
This is definitely wrong: MPI_Win_lock(MPI_LOCK_SHARED, pe, 0, win); \
MPI_Put(&val, 1, mpi_type, pe, \
sizeof(SharedAllocationHeader) + offset * _typesize, 1, mpi_type, \
win); \
MPI_Win_unlock(pe, win); \
MPI_Win_flush(pe, win); \ unlock does a flush, and if you have a flash outside of a lock epoch, that's wrong too. The correct pattern for RMA is to do lock_all immediately after allocate, and unlock_all immediately prior to free, and use only flush or flush_local to complete RMA operations. |
cf9eb3b should be correct, and if an MPI implementation errors with that, I think the implementation is wrong. |
The problem is due to threads. See https://www.mail-archive.com/[email protected]/msg30676.html. Using request-based synchronization is the correct change here, and shouldn't be too hard. |
You need to figure out a way to make this thread-safe:
I will send you a pull-request for flush->wait but it isn't exactly equivalent, because you had put+flush, which does remote completion, whereas rput+wait only does remote completion. There is no request-based remote completion (I proposed it but it didn't go anywhere) so you'll have to make sure that all KRS code uses KRS::fence to do remote completion of rput ops. |
correct fix for kokkos#51 related to https://www.mail-archive.com/[email protected]/msg30676.html Signed-off-by: Jeff Hammond <[email protected]>
@janciesko I cannot reproduce this issue with OMPI 5.0.x. Which version of OMPI are you using? |
ompi 4.0.5 |
5bd4123
to
1419359
Compare
1419359
to
2f7c60b
Compare
Updated the PR that removes the wrong use. @jeffhammond, I have mistakenly reintroduced the wrong code after your previous PR. Sorry about the confusion. The code as in this PR now works. Does the issue (mpiwg-rma/rma-issues#23) still apply? In this case here, we're still calling RMA ops and flush concurrently. |
Retest this please |
Superseded by #53 |
Restores the use of MPI_Win_lock/MPI_Win_unlock.
This currently fails the unit test as:
Reproducer: