[BUG] trace-agent: HTTPReceiver.recvsem has too few permits by default #31517

bberg-indeed · 2024-11-27T12:13:04Z

There's a critical section in the trace-agent receiver beginning here in which the trace-agent reads the body of the http request, parses it, sends a response, and writes the extracted spans to the r.out channel.

Unless overridden with the apm_config.decoders parameter, the semaphore guarding the critical section, recvsem, is created with max(1, GOMAXPROCS / 2) permits. When running in Kubernetes, GOMAXPROCS is set to the container's CPU limit. Typically we run with a CPU limit of 2, because it's very rare for the trace agent to require more CPU than this in our environment. As a result, recvsem is created with a single permit.

Because network I/O is unpredictable, and is done in the critical section, it's possible for a single request to hold a permit for up to 5 seconds. This can happen if, e.g., the client application encounters GC pauses or throttling while sending the request. While a slow request is holding the only permit, no other requests can be served.

As a result, client applications often have payloads rejected with 429 responses as the requests time out waiting for a semaphore permit. While the percentage of payloads rejected as a result of this issue is negligible and a tiny fraction of what's intentionally discarded due to sampling, it can be a result of confusion for our internal users, who may see errors about rejected payloads show up in their application logs.

I believe that this problem could be mitigated substantially simply by increasing the default number of permits to, e.g. max(2, GOMAXPROCS / 2), or perhaps more.

Agent Environment
7.57.2, running in Kubernetes deployed with the Helm chart

Describe what happened:
The trace agent occasionally rejects valid payloads with 429 errors despite not being throttled or otherwise resource-constrained.

Describe what you expected:
The trace agent should not reject valid payloads unless starved for resources.

Steps to reproduce the issue:
Code inspection is probably sufficient to understand the issue. Otherwise, I suppose this would work:

Run the trace agent with GOMAXPROCS = 2 (probably anything less than 4 would work)
Have a client application begin submitting a payload, and then just keep the connection open until it times out rather than sending the rest of the payload, while simultaneously submitting trace payloads with other applications, or on other threads in the same application. They should receive 429 responses.

Additional environment details (Operating System, Cloud provider, etc):

bberg-indeed added the team/triage label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] trace-agent: HTTPReceiver.recvsem has too few permits by default #31517

[BUG] trace-agent: HTTPReceiver.recvsem has too few permits by default #31517

bberg-indeed commented Nov 27, 2024 •

edited

Loading

[BUG] trace-agent: HTTPReceiver.recvsem has too few permits by default #31517

[BUG] trace-agent: HTTPReceiver.recvsem has too few permits by default #31517

Comments

bberg-indeed commented Nov 27, 2024 • edited Loading

bberg-indeed commented Nov 27, 2024 •

edited

Loading