You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's a critical section in the trace-agent receiver beginning here in which the trace-agent reads the body of the http request, parses it, sends a response, and writes the extracted spans to the r.out channel.
Unless overridden with the apm_config.decoders parameter, the semaphore guarding the critical section, recvsem, is created with max(1, GOMAXPROCS / 2) permits. When running in Kubernetes, GOMAXPROCS is set to the container's CPU limit. Typically we run with a CPU limit of 2, because it's very rare for the trace agent to require more CPU than this in our environment. As a result, recvsem is created with a single permit.
Because network I/O is unpredictable, and is done in the critical section, it's possible for a single request to hold a permit for up to 5 seconds. This can happen if, e.g., the client application encounters GC pauses or throttling while sending the request. While a slow request is holding the only permit, no other requests can be served.
As a result, client applications often have payloads rejected with 429 responses as the requests time out waiting for a semaphore permit. While the percentage of payloads rejected as a result of this issue is negligible and a tiny fraction of what's intentionally discarded due to sampling, it can be a result of confusion for our internal users, who may see errors about rejected payloads show up in their application logs.
I believe that this problem could be mitigated substantially simply by increasing the default number of permits to, e.g. max(2, GOMAXPROCS / 2), or perhaps more.
Agent Environment
7.57.2, running in Kubernetes deployed with the Helm chart
Describe what happened:
The trace agent occasionally rejects valid payloads with 429 errors despite not being throttled or otherwise resource-constrained.
Describe what you expected:
The trace agent should not reject valid payloads unless starved for resources.
Steps to reproduce the issue:
Code inspection is probably sufficient to understand the issue. Otherwise, I suppose this would work:
Run the trace agent with GOMAXPROCS = 2 (probably anything less than 4 would work)
Have a client application begin submitting a payload, and then just keep the connection open until it times out rather than sending the rest of the payload, while simultaneously submitting trace payloads with other applications, or on other threads in the same application. They should receive 429 responses.
There's a critical section in the trace-agent receiver beginning here in which the trace-agent reads the body of the http request, parses it, sends a response, and writes the extracted spans to the r.out channel.
Unless overridden with the apm_config.decoders parameter, the semaphore guarding the critical section, recvsem, is created with max(1, GOMAXPROCS / 2) permits. When running in Kubernetes, GOMAXPROCS is set to the container's CPU limit. Typically we run with a CPU limit of 2, because it's very rare for the trace agent to require more CPU than this in our environment. As a result, recvsem is created with a single permit.
Because network I/O is unpredictable, and is done in the critical section, it's possible for a single request to hold a permit for up to 5 seconds. This can happen if, e.g., the client application encounters GC pauses or throttling while sending the request. While a slow request is holding the only permit, no other requests can be served.
As a result, client applications often have payloads rejected with 429 responses as the requests time out waiting for a semaphore permit. While the percentage of payloads rejected as a result of this issue is negligible and a tiny fraction of what's intentionally discarded due to sampling, it can be a result of confusion for our internal users, who may see errors about rejected payloads show up in their application logs.
I believe that this problem could be mitigated substantially simply by increasing the default number of permits to, e.g. max(2, GOMAXPROCS / 2), or perhaps more.
Agent Environment
7.57.2, running in Kubernetes deployed with the Helm chart
Describe what happened:
The trace agent occasionally rejects valid payloads with 429 errors despite not being throttled or otherwise resource-constrained.
Describe what you expected:
The trace agent should not reject valid payloads unless starved for resources.
Steps to reproduce the issue:
Code inspection is probably sufficient to understand the issue. Otherwise, I suppose this would work:
Additional environment details (Operating System, Cloud provider, etc):
The text was updated successfully, but these errors were encountered: