Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helidon concurrency Limit - Semaphore not released during Broken Pipe exception from Jersey layer leading to all request failing with 503 #9442

Closed
logovind opened this issue Oct 29, 2024 · 8 comments · Fixed by #9460
Assignees
Labels
4.x Version 4.x bug Something isn't working P2 webserver

Comments

@logovind
Copy link

logovind commented Oct 29, 2024

Environment Details

  • Helidon Version: Helidon-4.1.x
  • Helidon SE or Helidon MP: SE & MP
  • JDK version: Java-21
  • OS: Mac OS
  • Docker version (if applicable):

We got the latest helidon-4.1.x code which has fix for Jira #9420( AimdImpl not releasing the semaphore) to test the AIMD feature further.

We tried out test with below config-

concurrency-limit:
  aimd:
    min-limit: 5
    max-limit: 10
    initial-limit: 5
    timeout: "PT2S"
    backoff-ratio: 0.8

When running test with 10 users, started seeing 503 errors and, we are seeing at time “Broken Pipe” exceptions.
When this happens, the server never recovers, and all requests continue to fail with 503 errors even after the load is stopped. When we checked further, we see the available permits is always 0 or -1 and remains at that value.

Below is the stack for broken pipe exception:

java.io.UncheckedIOException: java.net.SocketException: Broken pipe
at io.helidon.common.buffers.GrowingBufferData.writeTo(GrowingBufferData.java:69)
at io.helidon.common.socket.PlainSocket.write(PlainSocket.java:136)
at io.helidon.common.socket.SocketWriter.writeNow(SocketWriter.java:81)
at io.helidon.common.socket.SocketWriterDirect.write(SocketWriterDirect.java:43)
at io.helidon.webserver.http1.Http1ServerResponse$BlockingOutputStream.write(Http1ServerResponse.java:631)
at io.helidon.webserver.http1.Http1ServerResponse$BlockingOutputStream.write(Http1ServerResponse.java:505)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:125)
at java.base/java.io.BufferedOutputStream.implFlush(BufferedOutputStream.java:252)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:240)
at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:184)
at io.helidon.webserver.http1.Http1ServerResponse$ClosingBufferedOutputStream.close(Http1ServerResponse.java:801)
at org.glassfish.jersey.message.internal.CommittingOutputStream.close(CommittingOutputStream.java:251)
at org.glassfish.jersey.message.internal.OutboundMessageContext.close(OutboundMessageContext.java:568)
at org.glassfish.jersey.server.ContainerResponse.close(ContainerResponse.java:403)
at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:763)
at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:398)
at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:388)
at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)
-----
-----
at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253)
at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:696)
at io.helidon.microprofile.server.JaxRsService.doHandle(JaxRsService.java:234)
at io.helidon.microprofile.server.JaxRsService.lambda$handle$2(JaxRsService.java:185)
at io.helidon.common.context.Contexts.runInContext(Contexts.java:117)
at io.helidon.microprofile.server.JaxRsService.handle(JaxRsService.java:185)
at io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.doRoute(HttpRoutingImpl.java:169)
at io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.call(HttpRoutingImpl.java:124)
at io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.call(HttpRoutingImpl.java:102)
-----
-----
at io.helidon.webserver.http.Filters.filter(Filters.java:83)
at io.helidon.webserver.http.HttpRoutingImpl.route(HttpRoutingImpl.java:73)
at io.helidon.webserver.http1.Http1Connection.route(Http1Connection.java:533)
at io.helidon.webserver.http1.Http1Connection.handle(Http1Connection.java:223)
at io.helidon.webserver.ConnectionHandler.run(ConnectionHandler.java:169)
at io.helidon.common.task.InterruptableTask.call(InterruptableTask.java:47)
at io.helidon.webserver.ThreadPerTaskExecutor$ThreadBoundFuture.run(ThreadPerTaskExecutor.java:239)
at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)

During this time, In the thread dumps, we do see that bunch of threads are hung on “io.helidon.microprofile.server.JaxRsService$JaxRsResponseWriter.await()” method. They do not time out and remined in the same state.

Call stack below for the thread on await:

java.base/java.lang.VirtualThread.park(VirtualThread.java:596)
      java.base/java.lang.System$2.parkVirtualThread(System.java:2643)
      java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
      java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
      java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1099)
      java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
      io.helidon.microprofile.server.JaxRsService$JaxRsResponseWriter.await(JaxRsService.java:421)
      io.helidon.microprofile.server.JaxRsService.doHandle(JaxRsService.java:235)
      io.helidon.microprofile.server.JaxRsService.lambda$handle$2(JaxRsService.java:185)
      io.helidon.common.context.Contexts.runInContext(Contexts.java:117)
      io.helidon.microprofile.server.JaxRsService.handle(JaxRsService.java:185)
      io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.doRoute(HttpRoutingImpl.java:169)
      io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.call(HttpRoutingImpl.java:124)
      io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.call(HttpRoutingImpl.java:102)
      ----
      ------
      io.helidon.webserver.http.HttpRoutingImpl.route(HttpRoutingImpl.java:73)
      io.helidon.webserver.http1.Http1Connection.route(Http1Connection.java:533)
      io.helidon.webserver.http1.Http1Connection.handle(Http1Connection.java:223)
      io.helidon.webserver.ConnectionHandler.run(ConnectionHandler.java:169)
      io.helidon.common.task.InterruptableTask.call(InterruptableTask.java:47)
      io.helidon.webserver.ThreadPerTaskExecutor$ThreadBoundFuture.run(ThreadPerTaskExecutor.java:239)
      java.base/java.lang.VirtualThread.run(VirtualThread.java:329)

As the JaxRsResponseWriter.await() method waiting indefinitely, the permit.success() / permit.dropped() is never get called, so the semaphore never get released. This causes the current issue.

We went through the helidon code, found that Helidon using custom implementation of the container response writer(JAXRSResponseWriter) which is using a CountdownLatch that is initialized with the value of 1 and decremented to 0 on commit and failure methods. If these methods do not get called then await will hang.

The threads are indefinitely waiting in the JaxRsResponseWriter.await() causing this issue. So instead of using await(), we used timeout implementation of await(long timeout, TimeUnit unit). In this case our KPI(timeout in configuration-2secs) and we tried timeout of 5secs in await(long timeout, TimeUnit unit). After the timeout, the permit.dropped() was invoked as a result of the exception thrown and propagated. This in turn helped to release the semaphore properly and we no longer see the issue.

https://github.com/helidon-io/helidon/blob/helidon-4.1.x/microprofile/server/src/main/java/io/helidon/microprofile/server/JaxRsService.java

JaxRsResponseWriter.await() in JaxRsService.java:

public void await(Long timeout,TimeUnit timeunit) { // modified the method to take a timeout values
    boolean await = false;
    try {
        await = cdl.await(timeout,timeunit);
    } catch (InterruptedException e) {
        throw new RuntimeException("Failed to wait for Jersey to write response");
    }
    if(!await){
        throw new RuntimeException("Failed to wait for Jersey to write response within timeout-->"+timeout);
    }
   
}

Await Caller method – JaxRsService.doHandle():

private void doHandle(Context ctx, ServerRequest req, ServerResponse res) { 
      
        JaxRsResponseWriter writer = new JaxRsResponseWriter(res);
        requestContext.setWriter(writer);
        ------
  -------
        try {
            Long timeout = Long.valueOf("5"); 
            kpiMetricsContext.ifPresent(KeyPerformanceIndicatorSupport.DeferrableRequestContext::requestProcessingStarted);
            writer.await(timeout, TimeUnit.SECONDS); // modified the caller to pass the timeout value
                  -----
                  }
              -----
            }
@vasanth-bhat
Copy link

Make sure to call permit.dropped() in the exception handling logic to release the semaphore.

Based on what we observed that the RuntimeException thrown when timeout happens, eventually calls the permit.dropped() in the upper stack ( For ex : in Http1Connection )

@barchetta barchetta added webserver 4.x Version 4.x bug Something isn't working P2 labels Oct 31, 2024
@barchetta barchetta moved this from Triage to High priority in Backlog Oct 31, 2024
@barchetta barchetta added this to the 4.1.4 milestone Oct 31, 2024
@tomas-langer tomas-langer moved this from High priority to Sprint Scope in Backlog Oct 31, 2024
@tomas-langer
Copy link
Member

Resolving this through timeout on the count down latch does not seem to be the right solution.
We should be notified by Jersey if there is a problem - and then we count down. Maybe some exceptions are not handled the way they should, and the correct method on the writer is not invoked. I will try to write a reproducer, if possible, so I can discover the reason the latch is not counted down

@vasanth-bhat
Copy link

The details of jersey exception , and how Response writer is skipped is mentioned it eclipse-ee4j/jersey#5783.
As a resiliency , it is good to have a configurable timeout for await() on Helidon implementation of Response writer

@tomas-langer
Copy link
Member

I will count down the latch when close is called on the output stream - this will resolve the problem described above.
Adding a timeout on operations on the "critical path" always has a performance impact. The fix is working around that problem.
As closing a stream is required, the close will always be called, and the latch will always count down.

@tomas-langer
Copy link
Member

@logovind the linked PR should fix the issue. I have seen on the linked Jersey issue that you should be capable of testing the issue under load (I cannot easily reproduce the problem, unless I start killing connections), so this would be a great help.

Thanks

@logovind
Copy link
Author

logovind commented Nov 4, 2024

Hi @tomas-langer ,

We have separately tested the Helidon fix from PR - #9460. This is without any changes to Jersey code.

This is resolving the issue for Helidon use cases.

Here is the new exception stack for broken pipe:

As we can see the JaxRsService$ReleaseLatchStream.close getting called as part of below stack and decrements the latch in finally block. This ensures that await no longer blocks.

Nov 04, 2024 5:41:38 PM org.glassfish.jersey.server.ServerRuntime$Responder writeResponse
SEVERE: Error while closing the output stream in order to commit response.
io.helidon.webserver.ServerConnectionException: Failed to close server output stream
      at io.helidon.webserver.http1.Http1ServerResponse$ClosingBufferedOutputStream.close(Http1ServerResponse.java:848)
      at io.helidon.microprofile.server.JaxRsService$ReleaseLatchStream.close(JaxRsService.java:431)
      at org.glassfish.jersey.message.internal.CommittingOutputStream.close(CommittingOutputStream.java:251)
      at org.glassfish.jersey.message.internal.OutboundMessageContext.close(OutboundMessageContext.java:568)
      at org.glassfish.jersey.server.ContainerResponse.close(ContainerResponse.java:403)
      at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:763)
      at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:398)
      at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:388)
      at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)
      at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
      at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
      at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
      at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
      at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
      at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
      at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253)
      at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:696)
      at io.helidon.microprofile.server.JaxRsService.doHandle(JaxRsService.java:228)
      at io.helidon.microprofile.server.JaxRsService.lambda$handle$2(JaxRsService.java:185)
      at io.helidon.common.context.Contexts.runInContext(Contexts.java:117)
      at io.helidon.microprofile.server.JaxRsService.handle(JaxRsService.java:185)
      at io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.doRoute(HttpRoutingImpl.java:169)
      at io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.call(HttpRoutingImpl.java:124)
      at io.helidon.webserver.http.HttpRoutingImpl$RoutingExecutor.call(HttpRoutingImpl.java:102)
      at io.helidon.webserver.http.ErrorHandlers.runWithErrorHandling(ErrorHandlers.java:76)
      at io.helidon.webserver.http.Filters$FilterChainImpl.proceed(Filters.java:121)
      at io.helidon.webserver.observe.metrics.MetricsFeature.lambda$configureVendorMetrics$2(MetricsFeature.java:90)
      at io.helidon.webserver.http.Filters$FilterChainImpl.proceed(Filters.java:119)
      at io.helidon.webserver.security.SecurityContextFilter.filter(SecurityContextFilter.java:88)
      at io.helidon.webserver.http.Filters$FilterChainImpl.proceed(Filters.java:119)
      at io.helidon.webserver.accesslog.AccessLogHttpFeature.filter(AccessLogHttpFeature.java:86)
      at io.helidon.webserver.http.Filters$FilterChainImpl.proceed(Filters.java:119)
      at io.helidon.common.context.Contexts.runInContext(Contexts.java:117)
      at io.helidon.webserver.context.ContextRoutingFeature.filter(ContextRoutingFeature.java:50)
      at io.helidon.webserver.http.Filters$FilterChainImpl.proceed(Filters.java:119)
      at io.helidon.webserver.http.Filters.executeFilters(Filters.java:87)
      at io.helidon.webserver.http.Filters.lambda$filter$0(Filters.java:83)
      at io.helidon.webserver.http.ErrorHandlers.runWithErrorHandling(ErrorHandlers.java:76)
      at io.helidon.webserver.http.Filters.filter(Filters.java:83)
      at io.helidon.webserver.http.HttpRoutingImpl.route(HttpRoutingImpl.java:73)
      at io.helidon.webserver.http1.Http1Connection.route(Http1Connection.java:533)
      at io.helidon.webserver.http1.Http1Connection.handle(Http1Connection.java:223)
      at io.helidon.webserver.ConnectionHandler.run(ConnectionHandler.java:169)
      at io.helidon.common.task.InterruptableTask.call(InterruptableTask.java:47)
      at io.helidon.webserver.ThreadPerTaskExecutor$ThreadBoundFuture.run(ThreadPerTaskExecutor.java:239)
      at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)
Caused by: java.io.UncheckedIOException: java.net.SocketException: Broken pipe
      at io.helidon.common.buffers.GrowingBufferData.writeTo(GrowingBufferData.java:69)
      at io.helidon.common.socket.PlainSocket.write(PlainSocket.java:136)
      at io.helidon.common.socket.SocketWriter.writeNow(SocketWriter.java:81)
      at io.helidon.common.socket.SocketWriterDirect.write(SocketWriterDirect.java:43)
      at io.helidon.webserver.http1.Http1ServerResponse$BlockingOutputStream.write(Http1ServerResponse.java:676)
      at io.helidon.webserver.http1.Http1ServerResponse$BlockingOutputStream.write(Http1ServerResponse.java:547)
      at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:125)
      at java.base/java.io.BufferedOutputStream.implFlush(BufferedOutputStream.java:252)
      at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:240)
      at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:184)
      at io.helidon.webserver.http1.Http1ServerResponse$ClosingBufferedOutputStream.close(Http1ServerResponse.java:846)
      ... 45 more
Caused by: java.net.SocketException: Broken pipe
      at java.base/sun.nio.ch.SocketDispatcher.write0(Native Method)
      at java.base/sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:62)
      at java.base/sun.nio.ch.NioSocketImpl.tryWrite(NioSocketImpl.java:394)
      at java.base/sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:410)
      at java.base/sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:440)
      at java.base/sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:819)
      at java.base/java.net.Socket$SocketOutputStream.write(Socket.java:1195)
      at io.helidon.common.buffers.GrowingBufferData.writeTo(GrowingBufferData.java:66)
      ... 55 more

@tomas-langer
Copy link
Member

Thanks for checking this.
I am still not thrilled with the SEVER log statement in case of broken pipe, but that is a Jersey thing.

@github-project-automation github-project-automation bot moved this from Sprint Scope to Closed in Backlog Nov 4, 2024
@barchetta barchetta removed this from the 4.1.4 milestone Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.x Version 4.x bug Something isn't working P2 webserver
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants
@barchetta @tomas-langer @vasanth-bhat @logovind and others