Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more exit codes to detect interruption reason #764

Merged
merged 9 commits into from
Feb 10, 2025

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Feb 10, 2025

Fix #584

Following recent changes in handling of exit code, and as suggested, I propose to refine exit codes a bit further, especially to be better informed of situations where we would like to indicate something to the end user in Kiwix usage.

I replaced interrupted and browserCrashed properties of the crawler with a single interrupt_reason property, whose value is an enum indicating (when possible) the interruption reason.

I slightly changed the message on cancellation since there was now a repetition of "gracefully finishing current pages".

These changes made me realize that there was maybe a flaw in the serializeAndExit function since according to my code analysis the interrupted property is never set when this function is called, unless I miss something.

Tested locally and works as expected, some cases (reasonably easily feasible ones) covered by automated tests.

There is finally a small fix of documentation.

@benoit74 benoit74 marked this pull request as draft February 10, 2025 11:47
Store interupt reason directly instead of interrupted + browser crashed
flags, use it to infer proper exit code, and more exit code, one per
interruption reason.

Other small changes:
- Rename `InterruptedGraceful` exit code into `Cancelled` for clarity /
  consistency with redis operation
- Rename `InterruptedImmediate` exit code to `Interrupted` since there
  is no more confusion possible with `Cancelled` and other graceful
  interuptions have their own exit codes
- fix handling of SIGINT in serializeAndExit for which there is no
  interrupted value positioned
@benoit74 benoit74 marked this pull request as ready for review February 10, 2025 12:38
@benoit74
Copy link
Contributor Author

@tw4l @ikreymer this is ready for review

@ikreymer
Copy link
Member

Thanks, overall looks good! I actually have another PR that adds a conflict with this, but will clean it up.

These changes made me realize that there was maybe a flaw in the serializeAndExit function since according to my code analysis the interrupted property is never set when this function is called, unless I miss something.

It's possible for interrupted to be true, if an interrupt request was made, but the crawler has not yet exited.
The first interrupt attempt will wait for crawler to exit gracefully (finish current pages), the second will shut down the browser immediately.

ikreymer and others added 4 commits February 10, 2025 10:16
- only attempt to close browser if not browser crashed
- add timeout for browser.close()
- ensure browser crash results in healthchecker failure
- bump to 1.5.3
- keep browser.crashed flag in Browser, skip close if crashed
- healthchecker checks browser.crash
- keep markBrowserCrashed()
@tw4l
Copy link
Member

tw4l commented Feb 10, 2025

Nice to see this improvement! It would be great to document these exit codes in the crawler documentation so that folks don't have to dig into the code to understand what the codes mean.

@benoit74
Copy link
Contributor Author

It's possible for interrupted to be true, if an interrupt request was made, but the crawler has not yet exited.
The first interrupt attempt will wait for crawler to exit gracefully (finish current pages), the second will shut down the browser immediately.

Indeed, my bad.

It would be great to document these exit codes in the crawler documentation so that folks don't have to dig into the code to understand what the codes mean.

Definitely, will tackle this as well. Probably in a distinct PR to move this code change asap to release, should documentation take more discussions than expected ... Where should this be placed? Should we create a new Exit codes page in the User Guide, right after the Outputs page? Or as a section of this Outputs page?

@tw4l
Copy link
Member

tw4l commented Feb 10, 2025

Definitely, will tackle this as well. Probably in a distinct PR to move this code change asap to release, should documentation take more discussions than expected ... Where should this be placed? Should we create a new Exit codes page in the User Guide, right after the Outputs page? Or as a section of this Outputs page?

Thank you! Much appreciated :) I think a new Exit Codes section would be the simplest/most easily discovered way to document this.

Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the contribution!

@ikreymer
Copy link
Member

Yep, looks good! I merged some changes that readded browser.crashed, which is also used by healthchecker.
Tested with crawler that was getting stuck per #763, and seems to be good now.

Can add docs in a follow-up, this should be good to go!

src/main.ts Outdated
crawler.gracefulFinishOnInterrupt();
if (!crawler.interruptReason) {
logger.info("SIGNAL: interrupt request received, finishing current pages before exit...");
crawler.gracefulFinishOnInterrupt(InterruptReason.Cancelled);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this isn't quite correct - this isn't necessarily cancelled.
When we run in k8s, interrupt does not mean to cancel the crawl, it means to stop and possibly restart the crawler..
The signal could come from user or from k8s system (such as memory pressure, etc...)

@ikreymer
Copy link
Member

ikreymer commented Feb 10, 2025

Ah but there is some confusion but cancellation / interruption.
The way it works is: first interrupt signal results in waiting for crawler to finish pages, second interrupt signal results in immediate termination.

However, this does not mean the crawl is cancelled, esp. in k8s, but that the crawler needs to be restarted (for whatever reason). It could mean that its being moved to a different node, etc.. This is very important that the deletion of pod does not actually cancel the crawl automatically. The 'cancel' state is only entered through a special message sent via redis. In
The exit codes should differentiate between the graceful and immediate interrupt (though on k8s, it may only second one anyway). When a crawl is canceled, by request of the user, the exit code is actually success (so pod may shut down, nothing more to do!)

This is slightly more confusing because need to support different environments, both k8s and plain docker/podman

Note: It might make sense to revisit cancellation to make it clearer, but than can be done as follow-up.

@ikreymer ikreymer merged commit fc56c2c into webrecorder:main Feb 10, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better indicate the interruption reason
3 participants