Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REQMOD stuck when adapted request (body) is not forwarded #1966

Conversation

rousskov
Copy link
Contributor

@rousskov rousskov commented Dec 19, 2024

Squid forwards request bodies using BodyPipe objects. A BodyPipe object
has two associated agents: producer and consumer. Those agents are set
independently, at different processing stages. If BodyPipe consumer is
not set, the producer may get stuck waiting for BodyPipe buffer space.
When producer creates a BodyPipe, it effectively relies on some code
somewhere to register a consumer (or declare a registration failure),
but automatically tracking that expectation fulfillment is impractical

For REQMOD transactions involving adapted request bodies, including ICAP
204 transactions, Client::startRequestBodyFlow() sets body consumer. If
that method is not called, there will be no consumer, and REQMOD may get
stuck. Many if statements can block request forwarding altogether or
block a being-forwarded request from reaching that method. For example,
adapted_http_access and miss_access denials block request forwarding

Without REQMOD, request processing can probably get stuck for similar
lack-of-consumer reasons, but regular requests ought to be killed by
various I/O or forwarding timeouts. There are no such timeouts for those
REQMOD transactions that are only waiting for request body consumer to
clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over).
Relying on timeouts is also very inefficient

For a mgr:mem observer, stuck REQMOD transactions look like a ModXact
memory leak. A mgr:jobs report shows ModXact jobs with RBS(1) status

Squid forwards request bodies using BodyPipe objects. A BodyPipe object
has two associated agents: producer and consumer. Those agents are set
independently, at different processing stages. If BodyPipe consumer is
not set, the producer may get stuck waiting for BodyPipe buffer space.
When producer creates a BodyPipe, it effectively relies on some code
somewhere to register a consumer (or declare a registration failure),
but automatically tracking that expectation fulfillment is impractical.

For REQMOD transactions involving adapted request bodies, including ICAP
204 transactions, Client::startRequestBodyFlow() sets body consumer. If
that method is not called, there will be no consumer, and REQMOD may get
stuck. Many `if` statements can block request forwarding altogether or
block a being-forwarded request from reaching that method. For example,
adapted_http_access and miss_access denials block request forwarding.

Without REQMOD, request processing can probably get stuck for similar
lack-of-consumer reasons, but regular requests ought to be killed by
various I/O or forwarding timeouts. There are no such timeouts for those
REQMOD transactions that are only waiting for request body consumer to
clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over).
Relying on timeouts is also very inefficient.

For a `mgr:mem` observer, stuck REQMOD transactions look like a ModXact
memory leak. A `mgr:jobs` report shows ModXact jobs with RBS(1) status.

----

Cherry-picked SQUID-1030-ModXact-stuck-without-consumer-bag65 changes as
of commit 37cde62014bcc985e05eb6e985365417838564ba (effectively).
@rousskov
Copy link
Contributor Author

rousskov commented Dec 19, 2024

The bug was discovered in v6-based code. It can be reproduced in a lab using a 204-returning ICAP service by filling request BodyPipe buffer while denying that request forwarding using adapted_http_access or miss_access. The fix this PR is based on was tested using the affected v6-based deployment.

Implementing the added TODO requires significant and risky changes outside this bug fix scope.

@rousskov rousskov added M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Dec 19, 2024
src/BodyPipe.cc Show resolved Hide resolved
@yadij yadij removed the S-could-use-an-approval An approval may speed this PR merger (but is not required) label Dec 28, 2024
squid-anubis pushed a commit that referenced this pull request Dec 30, 2024
Squid forwards request bodies using BodyPipe objects. A BodyPipe object
has two associated agents: producer and consumer. Those agents are set
independently, at different processing stages. If BodyPipe consumer is
not set, the producer may get stuck waiting for BodyPipe buffer space.
When producer creates a BodyPipe, it effectively relies on some code
somewhere to register a consumer (or declare a registration failure),
but automatically tracking that expectation fulfillment is impractical.

For REQMOD transactions involving adapted request bodies, including ICAP
204 transactions, Client::startRequestBodyFlow() sets body consumer. If
that method is not called, there will be no consumer, and REQMOD may get
stuck. Many `if` statements can block request forwarding altogether or
block a being-forwarded request from reaching that method. For example,
adapted_http_access and miss_access denials block request forwarding.

Without REQMOD, request processing can probably get stuck for similar
lack-of-consumer reasons, but regular requests ought to be killed by
various I/O or forwarding timeouts. There are no such timeouts for those
REQMOD transactions that are only waiting for request body consumer to
clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over).
Relying on timeouts is also very inefficient.

For a `mgr:mem` observer, stuck REQMOD transactions look like a ModXact
memory leak. A `mgr:jobs` report shows ModXact jobs with RBS(1) status.
@squid-anubis squid-anubis added M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels M-failed-staging-other https://github.com/measurement-factory/anubis#pull-request-labels and removed M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels labels Dec 30, 2024
@kinkie
Copy link
Contributor

kinkie commented Dec 30, 2024

@rousskov Anubis' reason for M-failed-staging-other is

PR1966 (head: df83b6 staged: 387bcd history: load,merge,update), staged: 387bcd) process() failure: HttpError: Changes must be made through a pull request.
Required status checks \"build-tests(ubuntu-22.04,clang,default)\", \"build-tests(ubuntu-22.04,clang,maximus)\", \"build-tests(ubuntu-22.04,clang,minimal)\", \"build-tests(ubuntu-22.04,gcc,default)\", \"build-tests(ubuntu-22.04,gcc,maximus)\", \"build-tests(ubuntu-22.04,gcc,minimal)\", \"functionality-tests\", and \"source-maintenance-tests\" were not set by the expected GitHub apps.
  at fetch.then (/srv/anubis/anubis/node_modules/@octokit/request/dist-node/index.js:86:21)\n    at process._tickCallback (internal/process/next_tick.js:68:7)

Any idea what could it be? Nothing has changed from past practice here

@kinkie
Copy link
Contributor

kinkie commented Dec 30, 2024

Trying a rebase to remove the requirement for the CodeQL check and clear the slate on pull request checks

@rousskov
Copy link
Contributor Author

rousskov commented Dec 30, 2024

Any idea what could it be?

@eduard-bagdasaryan, please assist @kinkie in fixing CI. We have a bunch of PRs that are ready for merging, but Anubis has started reporting failures for their staged commits. I have not checked why. I do not know whether the current Anubis configuration and the currently running Anubis version are correct.

Nothing has changed from past practice here

@kinkie, I agree that relevant past practices remain unchanged, but let's focus on something we can fix. I bet that recent CI problems were triggered by changes in GitHub repository configuration and/or Anubis environment or their side effects. According to your recent comments, GitHub configuration and Anubis configuration have been changed (at least). I do not know what (side) effects of those changes were, but I hope that, with Eduard's help, CI functionality will be restored.

@kinkie
Copy link
Contributor

kinkie commented Dec 30, 2024

@rousskov Anubis' reason for M-failed-staging-other is

PR1966 (head: df83b6 staged: 387bcd history: load,merge,update), staged: 387bcd) process() failure: HttpError: Changes must be made through a pull request.
Required status checks \"build-tests(ubuntu-22.04,clang,default)\", \"build-tests(ubuntu-22.04,clang,maximus)\", \"build-tests(ubuntu-22.04,clang,minimal)\", \"build-tests(ubuntu-22.04,gcc,default)\", \"build-tests(ubuntu-22.04,gcc,maximus)\", \"build-tests(ubuntu-22.04,gcc,minimal)\", \"functionality-tests\", and \"source-maintenance-tests\" were not set by the expected GitHub apps.
  at fetch.then (/srv/anubis/anubis/node_modules/@octokit/request/dist-node/index.js:86:21)\n    at process._tickCallback (internal/process/next_tick.js:68:7)

Any idea what could it be? Nothing has changed from past practice here

These errors are documented here, however these statuses ARE set by Github actions which are a Github app.
Hopefully the rebase will clear things up

@rousskov
Copy link
Contributor Author

rousskov commented Dec 30, 2024

These errors are documented here,

Nice find, thank you!

however these statuses ARE set by Github actions which are a Github app.

I am not sure this assertion is true: IIRC, Anubis copies some of the PR status checks to the staged commit. While nothing has changed in Anubis recently, it is possible that GitHub started to validate status check sources, either because of GitHub internal improvements or because of our GitHub configuration changes (both are unknown to me). For copied statuses, the source app might be Anubis!

I do not know whether Anubis copies the statuses in question (e.g., build-tests(ubuntu-22.04,clang,default)). I am guessing that it should not copy them (today) because our GitHub configuration repeats/performs quick.yaml tests for staged commits. @eduard-bagdasaryan, are these statuses copied by Anubis?

@eduard-bagdasaryan
Copy link
Contributor

are these statuses copied by Anubis

Here is the list of 387bcd6 statuses copied by Anubis:

    "description": "build-tests(ubuntu-22.04,gcc,minimal) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,gcc,maximus) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,gcc,default) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,clang,default) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,clang,minimal) (copied from PR by Anubis)",
    "description": "build-tests(ubuntu-22.04,clang,maximus) (copied from PR by Anubis)",
    "description": "source-maintenance-tests (copied from PR by Anubis)",
    "description": "functionality-tests (copied from PR by Anubis)",

The error message generated by Anubis happens during merge and lists the same statuses:

[2024-12-30T15:35:53.805Z]  INFO: anubis/2416968 on bm2: PR1966 (head: df83b6 staged: 387bcd history: load,merge,update) merging to base...
[2024-12-30T15:35:53.805Z]  INFO: anubis/2416968 on bm2: PR1966 (head: df83b6 staged: 387bcd history: load,merge,update) allow merge staged due to M-cleared-for-merge overruling guarded_run option
[2024-12-30T15:35:54.115Z]  INFO: anubis/2416968 on bm2:
PR1966 (head: df83b6 staged: 387bcd history: load,merge,update), staged: 387bcd) process() failure: HttpError: Changes must be made through a pull request. Required status checks "build-tests(ubuntu-22.04,clang,default)", "build-tests(ubuntu-22.04,clang,maximus)", "build-tests(ubuntu-22.04,clang,minimal)", "build-tests(ubuntu-22.04,gcc,default)", "build-tests(ubuntu-22.04,gcc,maximus)", "build-tests(ubuntu-22.04,gcc,minimal)", "functionality-tests", and "source-maintenance-tests" were not set by the expected GitHub apps.

GitHub started to validate status check sources

I have a feeling that since we have changed nothing, something like that happened.

@kinkie
Copy link
Contributor

kinkie commented Dec 31, 2024

GitHub started to validate status check sources

I have a feeling that since we have changed nothing, something like that happened.

It is possible to instruct github to disregard the origin of the checks; if the PR currently being assessed show this behaviour, I'll try that

@eduard-bagdasaryan
Copy link
Contributor

were not set by the expected GitHub apps.

Perhaps we need to adjust the protected branch (master) settings, e.g., add the "Anubis" source (if it is there among other checkboxes) or set it to "any source". I cannot do it myself since I do not have permissions.

@kinkie
Copy link
Contributor

kinkie commented Dec 31, 2024

PR #1972 landed nicely. I suppose there is some stale state somewhere. Let's try to force things thorugh. Worst case we can redo the PRs

@kinkie kinkie removed the M-failed-staging-other https://github.com/measurement-factory/anubis#pull-request-labels label Dec 31, 2024
squid-anubis pushed a commit that referenced this pull request Dec 31, 2024
Squid forwards request bodies using BodyPipe objects. A BodyPipe object
has two associated agents: producer and consumer. Those agents are set
independently, at different processing stages. If BodyPipe consumer is
not set, the producer may get stuck waiting for BodyPipe buffer space.
When producer creates a BodyPipe, it effectively relies on some code
somewhere to register a consumer (or declare a registration failure),
but automatically tracking that expectation fulfillment is impractical

For REQMOD transactions involving adapted request bodies, including ICAP
204 transactions, Client::startRequestBodyFlow() sets body consumer. If
that method is not called, there will be no consumer, and REQMOD may get
stuck. Many `if` statements can block request forwarding altogether or
block a being-forwarded request from reaching that method. For example,
adapted_http_access and miss_access denials block request forwarding

Without REQMOD, request processing can probably get stuck for similar
lack-of-consumer reasons, but regular requests ought to be killed by
various I/O or forwarding timeouts. There are no such timeouts for those
REQMOD transactions that are only waiting for request body consumer to
clear adapted BodyPipe space (e.g., after all ICAP 204 I/Os are over).
Relying on timeouts is also very inefficient

For a `mgr:mem` observer, stuck REQMOD transactions look like a ModXact
memory leak. A `mgr:jobs` report shows ModXact jobs with RBS(1) status
@squid-anubis squid-anubis added the M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels label Dec 31, 2024
@squid-anubis squid-anubis removed M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels labels Dec 31, 2024
@squid-anubis squid-anubis added the M-merged https://github.com/measurement-factory/anubis#pull-request-labels label Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
M-merged https://github.com/measurement-factory/anubis#pull-request-labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants