REQMOD stuck when adapted request (body) is not forwarded #1966

rousskov · 2024-12-19T17:22:18Z

Squid forwards request bodies using BodyPipe objects. A BodyPipe object
has two associated agents: producer and consumer. Those agents are set
independently, at different processing stages. If BodyPipe consumer is
not set, the producer may get stuck waiting for BodyPipe buffer space.
When producer creates a BodyPipe, it effectively relies on some code
somewhere to register a consumer (or declare a registration failure),
but automatically tracking that expectation fulfillment is impractical

For REQMOD transactions involving adapted request bodies, including ICAP
204 transactions, Client::startRequestBodyFlow() sets body consumer. If
that method is not called, there will be no consumer, and REQMOD may get
stuck. Many if statements can block request forwarding altogether or
block a being-forwarded request from reaching that method. For example,
adapted_http_access and miss_access denials block request forwarding

Without REQMOD, request processing can probably get stuck for similar
lack-of-consumer reasons, but regular requests ought to be killed by
various I/O or forwarding timeouts. There are no such timeouts for those
REQMOD transactions that are only waiting for request body consumer to
clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over).
Relying on timeouts is also very inefficient

For a mgr:mem observer, stuck REQMOD transactions look like a ModXact
memory leak. A mgr:jobs report shows ModXact jobs with RBS(1) status

Squid forwards request bodies using BodyPipe objects. A BodyPipe object has two associated agents: producer and consumer. Those agents are set independently, at different processing stages. If BodyPipe consumer is not set, the producer may get stuck waiting for BodyPipe buffer space. When producer creates a BodyPipe, it effectively relies on some code somewhere to register a consumer (or declare a registration failure), but automatically tracking that expectation fulfillment is impractical. For REQMOD transactions involving adapted request bodies, including ICAP 204 transactions, Client::startRequestBodyFlow() sets body consumer. If that method is not called, there will be no consumer, and REQMOD may get stuck. Many `if` statements can block request forwarding altogether or block a being-forwarded request from reaching that method. For example, adapted_http_access and miss_access denials block request forwarding. Without REQMOD, request processing can probably get stuck for similar lack-of-consumer reasons, but regular requests ought to be killed by various I/O or forwarding timeouts. There are no such timeouts for those REQMOD transactions that are only waiting for request body consumer to clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over). Relying on timeouts is also very inefficient. For a `mgr:mem` observer, stuck REQMOD transactions look like a ModXact memory leak. A `mgr:jobs` report shows ModXact jobs with RBS(1) status. ---- Cherry-picked SQUID-1030-ModXact-stuck-without-consumer-bag65 changes as of commit 37cde62014bcc985e05eb6e985365417838564ba (effectively).

rousskov · 2024-12-19T17:42:15Z

The bug was discovered in v6-based code. It can be reproduced in a lab using a 204-returning ICAP service by filling request BodyPipe buffer while denying that request forwarding using adapted_http_access or miss_access. The fix this PR is based on was tested using the affected v6-based deployment.

Implementing the added TODO requires significant and risky changes outside this bug fix scope.

src/BodyPipe.cc

Squid forwards request bodies using BodyPipe objects. A BodyPipe object has two associated agents: producer and consumer. Those agents are set independently, at different processing stages. If BodyPipe consumer is not set, the producer may get stuck waiting for BodyPipe buffer space. When producer creates a BodyPipe, it effectively relies on some code somewhere to register a consumer (or declare a registration failure), but automatically tracking that expectation fulfillment is impractical. For REQMOD transactions involving adapted request bodies, including ICAP 204 transactions, Client::startRequestBodyFlow() sets body consumer. If that method is not called, there will be no consumer, and REQMOD may get stuck. Many `if` statements can block request forwarding altogether or block a being-forwarded request from reaching that method. For example, adapted_http_access and miss_access denials block request forwarding. Without REQMOD, request processing can probably get stuck for similar lack-of-consumer reasons, but regular requests ought to be killed by various I/O or forwarding timeouts. There are no such timeouts for those REQMOD transactions that are only waiting for request body consumer to clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over). Relying on timeouts is also very inefficient. For a `mgr:mem` observer, stuck REQMOD transactions look like a ModXact memory leak. A `mgr:jobs` report shows ModXact jobs with RBS(1) status.

kinkie · 2024-12-30T15:40:25Z

@rousskov Anubis' reason for M-failed-staging-other is

PR1966 (head: df83b6 staged: 387bcd history: load,merge,update), staged: 387bcd) process() failure: HttpError: Changes must be made through a pull request.
Required status checks \"build-tests(ubuntu-22.04,clang,default)\", \"build-tests(ubuntu-22.04,clang,maximus)\", \"build-tests(ubuntu-22.04,clang,minimal)\", \"build-tests(ubuntu-22.04,gcc,default)\", \"build-tests(ubuntu-22.04,gcc,maximus)\", \"build-tests(ubuntu-22.04,gcc,minimal)\", \"functionality-tests\", and \"source-maintenance-tests\" were not set by the expected GitHub apps.
  at fetch.then (/srv/anubis/anubis/node_modules/@octokit/request/dist-node/index.js:86:21)\n    at process._tickCallback (internal/process/next_tick.js:68:7)

Any idea what could it be? Nothing has changed from past practice here

kinkie · 2024-12-30T16:18:05Z

Trying a rebase to remove the requirement for the CodeQL check and clear the slate on pull request checks

rousskov · 2024-12-30T16:28:15Z

Any idea what could it be?

@eduard-bagdasaryan, please assist @kinkie in fixing CI. We have a bunch of PRs that are ready for merging, but Anubis has started reporting failures for their staged commits. I have not checked why. I do not know whether the current Anubis configuration and the currently running Anubis version are correct.

Nothing has changed from past practice here

@kinkie, I agree that relevant past practices remain unchanged, but let's focus on something we can fix. I bet that recent CI problems were triggered by changes in GitHub repository configuration and/or Anubis environment or their side effects. According to your recent comments, GitHub configuration and Anubis configuration have been changed (at least). I do not know what (side) effects of those changes were, but I hope that, with Eduard's help, CI functionality will be restored.

kinkie · 2024-12-30T16:41:30Z

@rousskov Anubis' reason for M-failed-staging-other is

PR1966 (head: df83b6 staged: 387bcd history: load,merge,update), staged: 387bcd) process() failure: HttpError: Changes must be made through a pull request.
Required status checks \"build-tests(ubuntu-22.04,clang,default)\", \"build-tests(ubuntu-22.04,clang,maximus)\", \"build-tests(ubuntu-22.04,clang,minimal)\", \"build-tests(ubuntu-22.04,gcc,default)\", \"build-tests(ubuntu-22.04,gcc,maximus)\", \"build-tests(ubuntu-22.04,gcc,minimal)\", \"functionality-tests\", and \"source-maintenance-tests\" were not set by the expected GitHub apps.
  at fetch.then (/srv/anubis/anubis/node_modules/@octokit/request/dist-node/index.js:86:21)\n    at process._tickCallback (internal/process/next_tick.js:68:7)

Any idea what could it be? Nothing has changed from past practice here

These errors are documented here, however these statuses ARE set by Github actions which are a Github app.
Hopefully the rebase will clear things up

rousskov · 2024-12-30T17:09:56Z

These errors are documented here,

Nice find, thank you!

however these statuses ARE set by Github actions which are a Github app.

I am not sure this assertion is true: IIRC, Anubis copies some of the PR status checks to the staged commit. While nothing has changed in Anubis recently, it is possible that GitHub started to validate status check sources, either because of GitHub internal improvements or because of our GitHub configuration changes (both are unknown to me). For copied statuses, the source app might be Anubis!

I do not know whether Anubis copies the statuses in question (e.g., build-tests(ubuntu-22.04,clang,default)). I am guessing that it should not copy them (today) because our GitHub configuration repeats/performs quick.yaml tests for staged commits. @eduard-bagdasaryan, are these statuses copied by Anubis?

Context: squid-cache#1976 (comment)

eduard-bagdasaryan · 2024-12-31T15:26:18Z

are these statuses copied by Anubis

Here is the list of 387bcd6 statuses copied by Anubis:

    "description": "build-tests(ubuntu-22.04,gcc,minimal) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,gcc,maximus) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,gcc,default) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,clang,default) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,clang,minimal) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,clang,maximus) (copied from PR by Anubis)",
    "description": "source-maintenance-tests (copied from PR by Anubis)",
    "description": "functionality-tests (copied from PR by Anubis)",

The error message generated by Anubis happens during merge and lists the same statuses:

[2024-12-30T15:35:53.805Z]  INFO: anubis/2416968 on bm2: PR1966 (head: df83b6 staged: 387bcd history: load,merge,update) merging to base...
[2024-12-30T15:35:53.805Z]  INFO: anubis/2416968 on bm2: PR1966 (head: df83b6 staged: 387bcd history: load,merge,update) allow merge staged due to M-cleared-for-merge overruling guarded_run option
[2024-12-30T15:35:54.115Z]  INFO: anubis/2416968 on bm2:
PR1966 (head: df83b6 staged: 387bcd history: load,merge,update), staged: 387bcd) process() failure: HttpError: Changes must be made through a pull request. Required status checks "build-tests(ubuntu-22.04,clang,default)", "build-tests(ubuntu-22.04,clang,maximus)", "build-tests(ubuntu-22.04,clang,minimal)", "build-tests(ubuntu-22.04,gcc,default)", "build-tests(ubuntu-22.04,gcc,maximus)", "build-tests(ubuntu-22.04,gcc,minimal)", "functionality-tests", and "source-maintenance-tests" were not set by the expected GitHub apps.

GitHub started to validate status check sources

I have a feeling that since we have changed nothing, something like that happened.

kinkie · 2024-12-31T15:38:20Z

GitHub started to validate status check sources

I have a feeling that since we have changed nothing, something like that happened.

It is possible to instruct github to disregard the origin of the checks; if the PR currently being assessed show this behaviour, I'll try that

eduard-bagdasaryan · 2024-12-31T15:41:13Z

were not set by the expected GitHub apps.

Perhaps we need to adjust the protected branch (master) settings, e.g., add the "Anubis" source (if it is there among other checkboxes) or set it to "any source". I cannot do it myself since I do not have permissions.

kinkie · 2024-12-31T17:26:54Z

PR #1972 landed nicely. I suppose there is some stale state somewhere. Let's try to force things thorugh. Worst case we can redo the PRs

Squid forwards request bodies using BodyPipe objects. A BodyPipe object has two associated agents: producer and consumer. Those agents are set independently, at different processing stages. If BodyPipe consumer is not set, the producer may get stuck waiting for BodyPipe buffer space. When producer creates a BodyPipe, it effectively relies on some code somewhere to register a consumer (or declare a registration failure), but automatically tracking that expectation fulfillment is impractical For REQMOD transactions involving adapted request bodies, including ICAP 204 transactions, Client::startRequestBodyFlow() sets body consumer. If that method is not called, there will be no consumer, and REQMOD may get stuck. Many `if` statements can block request forwarding altogether or block a being-forwarded request from reaching that method. For example, adapted_http_access and miss_access denials block request forwarding Without REQMOD, request processing can probably get stuck for similar lack-of-consumer reasons, but regular requests ought to be killed by various I/O or forwarding timeouts. There are no such timeouts for those REQMOD transactions that are only waiting for request body consumer to clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over). Relying on timeouts is also very inefficient For a `mgr:mem` observer, stuck REQMOD transactions look like a ModXact memory leak. A `mgr:jobs` report shows ModXact jobs with RBS(1) status

rousskov added M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Dec 19, 2024

yadij reviewed Dec 27, 2024

View reviewed changes

src/BodyPipe.cc Show resolved Hide resolved

yadij approved these changes Dec 28, 2024

View reviewed changes

yadij removed the S-could-use-an-approval An approval may speed this PR merger (but is not required) label Dec 28, 2024

Merge branch 'master' into SQUID-1030-ModXact-stuck-without-consumer

a57cf3e

fixup: Empty commit to trigger GitHub Actions redo

4edfc20

Context: squid-cache#1976 (comment)

rousskov mentioned this pull request Dec 31, 2024

CI: Add workflow_dispatch to workflows that still lack it #1976

Closed

Merge branch 'master' into SQUID-1030-ModXact-stuck-without-consumer

adf33e6

kinkie removed the M-failed-staging-other https://github.com/measurement-factory/anubis#pull-request-labels label Dec 31, 2024

squid-anubis added the M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels label Dec 31, 2024

squid-anubis closed this Dec 31, 2024

squid-anubis removed M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels labels Dec 31, 2024

squid-anubis added the M-merged https://github.com/measurement-factory/anubis#pull-request-labels label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REQMOD stuck when adapted request (body) is not forwarded #1966

REQMOD stuck when adapted request (body) is not forwarded #1966

rousskov commented Dec 19, 2024 •

edited by kinkie

Loading

rousskov commented Dec 19, 2024 •

edited

Loading

kinkie commented Dec 30, 2024

kinkie commented Dec 30, 2024

rousskov commented Dec 30, 2024 •

edited

Loading

kinkie commented Dec 30, 2024

rousskov commented Dec 30, 2024 •

edited

Loading

eduard-bagdasaryan commented Dec 31, 2024

kinkie commented Dec 31, 2024

eduard-bagdasaryan commented Dec 31, 2024

kinkie commented Dec 31, 2024

REQMOD stuck when adapted request (body) is not forwarded #1966

REQMOD stuck when adapted request (body) is not forwarded #1966

Conversation

rousskov commented Dec 19, 2024 • edited by kinkie Loading

rousskov commented Dec 19, 2024 • edited Loading

kinkie commented Dec 30, 2024

kinkie commented Dec 30, 2024

rousskov commented Dec 30, 2024 • edited Loading

kinkie commented Dec 30, 2024

rousskov commented Dec 30, 2024 • edited Loading

eduard-bagdasaryan commented Dec 31, 2024

kinkie commented Dec 31, 2024

eduard-bagdasaryan commented Dec 31, 2024

kinkie commented Dec 31, 2024

rousskov commented Dec 19, 2024 •

edited by kinkie

Loading

rousskov commented Dec 19, 2024 •

edited

Loading

rousskov commented Dec 30, 2024 •

edited

Loading

rousskov commented Dec 30, 2024 •

edited

Loading