-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing or hanging tests #26
Comments
FWIW, I consistently see hangs of
But not when run on its own
|
I actually see the same, but in my case it's mostly |
Here's a backtrace (not tried with #30 yet)
|
I had a try building with |
It looks as if we're not progressing the worker, but it seems like we're on an endless loop of that. So we're not exactly deadlocked, but we're waiting for a message to complete. |
I'm stupid. #32 is supposed to resolve this, I forgot it wasn't merged yet. Can you check if it works with that PR? |
Yes, with #32 I don't get a hang, I just have (running the whole test suite) on failure:
|
Yes, that's why I couldn't merge #32 yet. 🙁 |
But not all the time :( |
Same for the PR, it happens just enough to make the PR less stable than what is currently running. |
Hmm, if I do some not very scientific printf-debugging (basically this diff):
I see that sometimes the |
Sometimes address santizer complains like this: (compiled with
|
Which kind of aligns with the hypothesis that there's no synchronisation between the test main thread and the one running the callbacks. That use-after-free looks like the main thread has got to the end of the test, deallocated the Note I have this diff in the test: diff --git a/cpp/tests/request.cpp b/cpp/tests/request.cpp
index 723dda7..81cfce4 100644
--- a/cpp/tests/request.cpp
+++ b/cpp/tests/request.cpp
@@ -221,6 +221,8 @@ TEST_P(RequestTest, TagUserCallback)
auto checkStatus = [&requests, &requestStatus](std::shared_ptr<void> data) {
auto idx = *std::static_pointer_cast<size_t>(data);
+ std::cout << "Running callback" << std::endl;
+ std::cout.flush();
if (requests[idx] == nullptr) {
/**
* Unfortunately, we can't check the status this way if the request completes
@@ -231,6 +233,7 @@ TEST_P(RequestTest, TagUserCallback)
*/
requestStatus[idx] = UCS_OK;
} else {
+ if (requests[idx]->getStatus() != UCS_OK) abort();
requestStatus[idx] = requests[idx]->getStatus();
}
};
@@ -242,12 +245,15 @@ TEST_P(RequestTest, TagUserCallback)
requests[0] = _ep->tagSend(_sendPtr[0], _messageSize, 0, false, checkStatus, sendIndex);
requests[1] = _ep->tagRecv(_recvPtr[0], _messageSize, 0, false, checkStatus, recvIndex);
waitRequests(_worker, requests, _progressWorker);
+ std::cout << "Waited" << std::endl;
+ std::cout.flush();
copyResults();
// Assert status was set before user callback is executed
- for (const auto status : requestStatus)
- ASSERT_THAT(status, UCS_OK);
+ // for (const auto status : requestStatus)
+ // if (status != UCS_OK) abort();
+ // ASSERT_THAT(status, UCS_OK);
for (const auto request : requests)
ASSERT_THAT(request->getStatus(), UCS_OK);
|
Based on your debug output it seems that you're right, interestingly this seems to occur only for |
I think what we need is to actually split the status the callback sees from what the user sees (request's result). By setting the request's status before calling the callback it may really be the case that the user will check the status before the callback runs. I'm thinking the better way is to switch back setting the status after the callback runs, and insert the status as part of the callback, similar to what UCP does ucp_send_nbx_callback_t that it receives the status and the user argument. Essentially we would change ucxx::Request::_callback by |
I think it makes sense to mimic |
Some of the tests may non-deterministically fail. They seem to be more prevalent in non-InfiniBand setups, such as CI.
C++ tests (some are possibly related to #15):
ProgressModes/WorkerProgressTest.ProgressTag/2
Python (unable to reproduce locally so far):
test_send_recv{,_multi}[{cupy,numba}]
The text was updated successfully, but these errors were encountered: