-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helidon concurrency Limit - Semaphore not released during Broken Pipe exception from Jersey layer leading to all request failing with 503 #9442
Comments
Based on what we observed that the RuntimeException thrown when timeout happens, eventually calls the permit.dropped() in the upper stack ( For ex : in Http1Connection ) |
Resolving this through timeout on the count down latch does not seem to be the right solution. |
The details of jersey exception , and how Response writer is skipped is mentioned it eclipse-ee4j/jersey#5783. |
I will count down the latch when |
@logovind the linked PR should fix the issue. I have seen on the linked Jersey issue that you should be capable of testing the issue under load (I cannot easily reproduce the problem, unless I start killing connections), so this would be a great help. Thanks |
Hi @tomas-langer , We have separately tested the Helidon fix from PR - #9460. This is without any changes to Jersey code. This is resolving the issue for Helidon use cases. Here is the new exception stack for broken pipe: As we can see the JaxRsService$ReleaseLatchStream.close getting called as part of below stack and decrements the latch in finally block. This ensures that await no longer blocks.
|
Thanks for checking this. |
Environment Details
We got the latest helidon-4.1.x code which has fix for Jira #9420( AimdImpl not releasing the semaphore) to test the AIMD feature further.
We tried out test with below config-
When running test with 10 users, started seeing 503 errors and, we are seeing at time “Broken Pipe” exceptions.
When this happens, the server never recovers, and all requests continue to fail with 503 errors even after the load is stopped. When we checked further, we see the available permits is always 0 or -1 and remains at that value.
Below is the stack for broken pipe exception:
During this time, In the thread dumps, we do see that bunch of threads are hung on “io.helidon.microprofile.server.JaxRsService$JaxRsResponseWriter.await()” method. They do not time out and remined in the same state.
Call stack below for the thread on await:
As the JaxRsResponseWriter.await() method waiting indefinitely, the permit.success() / permit.dropped() is never get called, so the semaphore never get released. This causes the current issue.
We went through the helidon code, found that Helidon using custom implementation of the container response writer(JAXRSResponseWriter) which is using a CountdownLatch that is initialized with the value of 1 and decremented to 0 on commit and failure methods. If these methods do not get called then await will hang.
The threads are indefinitely waiting in the JaxRsResponseWriter.await() causing this issue. So instead of using await(), we used timeout implementation of await(long timeout, TimeUnit unit). In this case our KPI(timeout in configuration-2secs) and we tried timeout of 5secs in await(long timeout, TimeUnit unit). After the timeout, the permit.dropped() was invoked as a result of the exception thrown and propagated. This in turn helped to release the semaphore properly and we no longer see the issue.
https://github.com/helidon-io/helidon/blob/helidon-4.1.x/microprofile/server/src/main/java/io/helidon/microprofile/server/JaxRsService.java
JaxRsResponseWriter.await() in JaxRsService.java:
Await Caller method – JaxRsService.doHandle():
The text was updated successfully, but these errors were encountered: