Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] trace-agent: HTTPReceiver.recvsem has too few permits by default #31517

Open
bberg-indeed opened this issue Nov 27, 2024 · 0 comments
Open

Comments

@bberg-indeed
Copy link

bberg-indeed commented Nov 27, 2024

There's a critical section in the trace-agent receiver beginning here in which the trace-agent reads the body of the http request, parses it, sends a response, and writes the extracted spans to the r.out channel.

Unless overridden with the apm_config.decoders parameter, the semaphore guarding the critical section, recvsem, is created with max(1, GOMAXPROCS / 2) permits. When running in Kubernetes, GOMAXPROCS is set to the container's CPU limit. Typically we run with a CPU limit of 2, because it's very rare for the trace agent to require more CPU than this in our environment. As a result, recvsem is created with a single permit.

Because network I/O is unpredictable, and is done in the critical section, it's possible for a single request to hold a permit for up to 5 seconds. This can happen if, e.g., the client application encounters GC pauses or throttling while sending the request. While a slow request is holding the only permit, no other requests can be served.

As a result, client applications often have payloads rejected with 429 responses as the requests time out waiting for a semaphore permit. While the percentage of payloads rejected as a result of this issue is negligible and a tiny fraction of what's intentionally discarded due to sampling, it can be a result of confusion for our internal users, who may see errors about rejected payloads show up in their application logs.

I believe that this problem could be mitigated substantially simply by increasing the default number of permits to, e.g. max(2, GOMAXPROCS / 2), or perhaps more.

Agent Environment
7.57.2, running in Kubernetes deployed with the Helm chart

Describe what happened:
The trace agent occasionally rejects valid payloads with 429 errors despite not being throttled or otherwise resource-constrained.

Describe what you expected:
The trace agent should not reject valid payloads unless starved for resources.

Steps to reproduce the issue:
Code inspection is probably sufficient to understand the issue. Otherwise, I suppose this would work:

  1. Run the trace agent with GOMAXPROCS = 2 (probably anything less than 4 would work)
  2. Have a client application begin submitting a payload, and then just keep the connection open until it times out rather than sending the rest of the payload, while simultaneously submitting trace payloads with other applications, or on other threads in the same application. They should receive 429 responses.

Additional environment details (Operating System, Cloud provider, etc):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant