-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: massStoreRunAsynchronous()
#4326
base: master
Are you sure you want to change the base?
feat: massStoreRunAsynchronous()
#4326
Conversation
This patch implements the whole support ecosystem for server-side background tasks, in order to help lessen the load (and blocking) of API handlers in the web-server for long-running operations. A **Task** is represented by two things in strict co-existence: a lightweight, `pickle`-able implementation in the server's code (a subclass of `AbstractTask`) and a corresponding `BackgroundTask` database entity, which resides in the "configuration" database (shared across all products). A Task is created by API request handlers and then the user is instructed to retain the `TaskToken`: the task's unique identifier. Following, the server will dispatch execution of the object into a background worker process, and keep status synchronisation via the database. Even in a service cluster deployment, load balancing will not interfere with users' ability to query a task's status. While normal users can only query the status of a single task (which is usually automatically done by client code, and not the user manually executing something); product administrators, and especially server administrators have the ability to query an arbitrary set of tasks using the potential filters, with a dedicated API function (`getTasks()`) for this purpose. Tasks can be cancelled only by `SUPERUSER`s, at which point a special binary flag is set in the status record. However, to prevent complicating inter-process communication, cancellation is supposed to be implemented by `AbstractTask` subclasses in a co-operative way. The execution of tasks in a process and a `Task`'s ability to "communicate" with its execution environment is achieved through the new `TaskManager` instance, which is created for every process of a server's deployment. Unfortunately, tasks can die gracelessly if the server is terminated (either internally, or even externally). For this reason, the `DROPPED` status will indicate that the server has terminated prior to, or during a task's execution, and it was unable to produce results. The server was refactored significantly around the handling of subprocesses in order to support various server shutdown scenarios. Servers will start `background_worker_processes` number of task handling subprocesses, which are distinct from the already existing "API handling" subprocesses. By default, if unconfigured, `background_worker_processes` is equal to `worker_processes` (the number of API processes to spawn), which is equal to `$(nproc)` (CPU count in the system). This patch includes a `TestingDummyTask` demonstrative subclass of `AbstractTask` which counts up to an input number of seconds, and each second it gracefully checks whether it is being killed. The corresponding testing API endpoint, `createDummyTask()` can specify whether the task should simulate a failing status. This endpoint can only be used from, but is used extensively, the unit testing of the project. This patch does not include "nice" or "ergonomic" facilities for admins to manage the tasks, and so far, only the server-side of the corresponding API calls are supported.
This patch extends `CodeChecker cmd` with a new sub-command, `serverside-tasks`, which lets users and administrators deal with querying the status of running server-side tasks. By default, the CLI queries the information of the task(s) specified by their token(s) in the `--token` argument from the server using `getTaskInfo(token)`, and shows this information in either verbose "plain text" (available if precisely **one** task was specified), "table" or JSON formats. In addition to `--token`, it also supports 19 more parameters, each of which correspond to a filter option in the `TaskFilter` API type. If any filters in addition to `--token` is specified, it will exercise `getTasks(filter)` instead. This mode is only available to administrators. The resulting, more detailed information structs are printed in "table" or JSON formats. Apart from querying the current status, two additional flags are available, irrespective of which query method is used to obtain a list of "matching tasks": * `--kill` will call `cancelTask(token)` for each task. * `--await` will block execution until the specified task(s) terminate (in one way or another). `--await` is implemented by calling the new **`await_task_termination`** library function, which is implemented with the goal of being reusable by other clients later.
627f70b
to
1d1aacc
Compare
Separate the previously blocking execution of `massStoreRun()`, which was done in the context of the "API handler process", into a "foreground" and a "background" part, exploiting the previously implemented background task library support. The foreground part remains executed in the context of the API handler process, and deals with receiving and unpacking the to-be-stored data, saving configuration and checking constraints that are cheap to check. The foreground part can report issues synchronously to the client. Everything else that was part of the previous `massStoreRun()` pipeline, as implemented by the `mass_store_run.MassStoreRun` class becomes a background task, such as the parsing of the uploaded reports and the storing of data to the database. This background task, implemented using the new library, executes in a separate background worker process, and can not communicate directly with the user. Errors are logged to the `comments` fields. The `massStoreRun()` API handler will continue to work as previously, and block while waiting for the background task to terminate. In case of an error, it synchronously reports a `RequestFailed` exception, passing the `comments` field (into which the background process had written the exception details) to the client. Due to the inability for most of the exceptions previously caused in `MassStoreRun` to "escape" as `RequestFailed`s, some parts of the API had been deprecated and removed. Namely, `ErrorCode.SOURCE_FILE` and `ErrorCode.REPORT_FORMAT` are no longer sent over the API. This does not break existing behaviour and does not cause an incompatibility with clients: in cases where the request exceptions were raised earlier, now a different type of exception is raised, but the error message still precisely explains the problem as it did previously.
cbe234b
to
ffc6bc1
Compare
Even though commit d915473 introduced a socket-level TCP keepalive support into the server's implementation, this was observed multiple times to not be enough to **deterministically** fix the issues with the `CodeChecker store` client hanging indefinitely when the server takes a long time processing the to-be-stored data. The underlying reasons are not entirely clear and the issue only pops up sporadically, but we did observe a few similar scenarios (such as multi-million report storage from analysing LLVM and then storing between datacentres) where it almost reliably reproduces. The symptoms (even with a configure `kepalive`) generally include the server not becoming notified about the client's disconnect, while the client process is hung on a low-level system call `read(4, ...)`, trying to get the Thrift response of `massStoreRun()` from the HTTP socket. Even if the server finishes the storage processing "in time" and sent the Thrift reply, it never reaches the client, which means it never exits from the waiting, which means it keeps either the terminal or, worse, a CI script occupied, blocking execution. This is the "more proper solution" foreshadowed in commit 15af7d8. Implemented the server-side logic to spawn a `MassStoreRun` task and return its token, giving the `massStoreRunAsynchronous()` API call full force. Implemented the client-side logic to use the new `task_client` module and the same logic as `CodeChecker cmd serverside-tasks --await --token TOKEN...` to poll the server for the task's completion and status.
ffc6bc1
to
e30fcc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this patch and it seems to me that the storage performance is lower than with the 6.24.1 release.
On single storage thread storing the same set of xerces reports with 6.24.1 server the storage time is ~9s (on the client) and with this version is around ~19s. See the details below.
Did you observe similar behaviour? What could be the reason?
Otherwise the queue handling and the server seemed stable.
Reproduction:
- 6.24.1 client 6.24.1 server:
time CodeChecker store /workspace/test-projects/xerces-c/reports --url http://localhost:8002/Default -f -n testing
----======== Summary ========----
-------------------------------------------------------
Number of processed analyzer result files | 700
Number of analyzer reports | 1322
Number of source files | 0
Number of source files with source code comments | 0
Number of blame information files | 0
-------------------------------------------------------
----=================----
[INFO 2024-09-23 18:27] - Compressing report zip file...
[INFO 2024-09-23 18:27] - Compressing report zip file done (8.3MiB / 238.5KiB).
[INFO 2024-09-23 18:27] - Storing results to the server...
[INFO 2024-09-23 18:28] - Storage finished successfully.
real 0m9.046s
user 0m9.662s
sys 0m1.812s
- 6.24.1 client this patch version server
ime CodeChecker store /workspace/test-projects/xerces-c/reports --url http://localhost:8001/Default -f -n testing
[INFO 2024-09-23 18:29] - Storing analysis results for run 'testing
----======== Summary ========----
-------------------------------------------------------
Number of processed analyzer result files | 700
Number of analyzer reports | 1322
Number of source files | 0
Number of source files with source code comments | 0
Number of blame information files | 0
-------------------------------------------------------
----=================----
[INFO 2024-09-23 18:29] - Compressing report zip file...
[INFO 2024-09-23 18:29] - Compressing report zip file done (8.3MiB / 238.8KiB).
[INFO 2024-09-23 18:29] - Storing results to the server...
[INFO 2024-09-23 18:29] - Storage finished successfully.
real 0m19.856s
user 0m9.905s
sys 0m1.585s
- this version client - this version server
time CodeChecker store /workspace/test-projects/xerces-c/reports --url http://localhost:8001/Default -n testing
----======== Summary ========----
-------------------------------------------------------
Number of processed analyzer result files | 700
Number of analyzer reports | 1322
Number of source files | 0
Number of source files with source code comments | 0
Number of blame information files | 0
-------------------------------------------------------
----=================----
[INFO 2024-09-23 18:30] - Compressing report zip file...
[INFO 2024-09-23 18:30] - Compressing report zip file done (8.3 MiB / 238.8 KiB).
[INFO 2024-09-23 18:30] - Storing results to the server ...
[INFO 2024-09-23 18:30] - Reports submitted to the server for processing.
[INFO 2024-09-23 18:30] - Checking local passwords or tokens in /home/ednikru/.codechecker.passwords.json
[INFO 2024-09-23 18:30] - Checking for local valid sessions.
[INFO 2024-09-23 18:30] - Storing the reports finished successfully.
real 0m15.813s
user 0m9.852s
sys 0m1.668s
------------------------
I also tested this with the performance tester tool
`python3 /workspace/test/codechecker/scripts/test/run_server_performance_test.py /workspace/test-projects/xerces-c/reports --url http://localhost:8001/Default -o /tmp/codechecker_performance_625.log -u 10 -r 1`
iff /tmp/codechecker_performance_6241.log /tmp/codechecker_performance_625.log
2,11c2,11
< 2,15.245582,6.167334,3.281972,0.927049,0.845769
< 5,15.377179,5.008921,3.431427,1.165954,0.953085
< 9,15.298618,6.457655,3.51786,1.46451,3.273755
< 4,18.013311,5.154086,4.70034,0.84808,2.343666
< 8,16.559763,5.145486,3.394403,1.334375,0.92755
< 3,19.000909,5.082276,3.441946,1.04042,0.845659
< 7,20.805081,4.975545,3.223493,0.853344,1.445693
< 10,24.562988,5.192637,3.201599,0.960122,0.814222
< 1,10.985903,5.026051,3.286415,1.072666,0.704679
< 6,11.313735,4.791362,3.351084,0.939348,0.935826
7,29.80368,5.451236,4.618323,2.434601,2.163928
8,40.068657,6.56116,4.520448,2.319387,1.735899
10,40.240583,5.458642,3.72705,1.445456,1.494371
4,41.033482,5.581793,3.885928,1.479313,1.115965
2,45.766335,5.268967,3.681706,2.609846,1.123228
5,55.890535,5.370175,3.754132,1.426652,1.461334
1,45.691527,5.271719,3.534699,1.287154,1.023857
3,60.307092,5.332355,3.763976,1.282365,0.998435
6,55.516671,5.226814,3.734352,1.628265,1.058211
The client is sleeping, as expected (as evident from
Not to a measurable extent, but I was not doing extremely strict measurements either.
It's a scheduling artefact. Consider that the original storage is about 10 seconds long. Now, let's suppose the unzipping takes virtually 0 time. What will happen, is that the client sends the mSRA() request, and gets the response, then sends a poll. This poll will naturally say What could also happen (but likely did not happen here; also, please try with The good news is that the minimum and maximum threshold for the polling sleep value can be tinkered with, both at the library level, and at the call site in This explains both the observations of 15-sec wall clock in the "both server and client patched" scenario, and the "basically 20-sec" wall clock in the "server patched, client not" scenario. N.B., that using a "legacy" client, and calling mSR(), the sleeping is done in the serverside API handler, and all of this fancy "variable resistor" behaviour of the sleep time calculation is skipped: there is a 5-sec sleep rhythm, but that's (as usual on non-realtime kernels) not a hard accurate thing. So the server is, once again, sleeping in 5 sec increments, basically mirroring what the client is close to doing. while True:
time.sleep(5)
t = self._task_manager.get_task_record(token)
Asymptotically, min=1sec could also be viable, but in large deployments, that would result in a lot of flooding early on. The optimisation question is, do we expect small or large workloads? Because for large workloads (where the server-side processing takes minutes, maybe tens of minutes), these changes will not be so impactful. It's only this impactful at, let's say, academically trivial workloads. |
I tested this implementation further with larger storages: ~156000 reports. The storage takes ~9 minutes on the 6.24.1 server and 10 minutes 46 seconds on the new server. So I approve the PR. See the detailed results below.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the concept of executing these long lasting storages as background jobs is very good as it makes it very clear to administrators what storage jobs are executed in parallel, gives a good overview of the storage times and it makes fixes the long resetting TCP connections problem too.
I think it would be a valuable addition in the future to be able to query the latest queuing time, to be able to detect (for operators) if the storage queu is very long and maybe increasing.
LGTM, nice work!
Important
⛔ Blocked by #4318.
This is patch 3 of the Asynchronous Store Protocol (#3672).
Closes #1306. Closes #2417. Closes #3671. Closes #3672.
Motivation
Even though #3167 introduced a socket-level TCP keepalive support into the server's implementation, this was observed multiple times to not be enough to deterministically fix the issues with the
CodeChecker store
client hanging indefinitely (#3672) when the server takes a long time processing the to-be-stored data. The underlying reasons are not entirely clear and the issue only pops up sporadically, but we did observe a few similar scenarios (such as multi-million report storage from analysing LLVM and then storing between data centres) where it almost reliably reproduces.The symptoms (even with a configured
keepalive
) generally include the server not becoming notified about the client's disconnect, while the client process is hung on a low-level system callread(4, ...)
, trying to get the Thrift response ofmassStoreRun()
from the HTTP socket. Even if the server finishes the storage processing "in time" and sends the Thrift reply, it never reaches the client, which means it never exits from the waiting, which means it keeps either the terminal or, worse, a CI script occupied, blocking execution.Solution, part 1. Backgrounding
massStoreRun()
To solve this, I separated the previously blocking execution of
massStoreRun()
, which was done in the context of the "API handler process", into a foreground and a background part, exploiting the background task library support implemented in #4317.The foreground part remains executed in the context of the API handler process, and deals with receiving and unpacking the to-be-stored data, saving configuration and checking constraints that are cheap to check.
The foreground part can report issues synchronously to the client.
Everything else that was part of the previous
massStoreRun()
pipeline, as implemented by themass_store_run.MassStoreRun
class becomes a background task, such as the parsing of the uploaded reports and the storing of data to the database.This background task, implemented using the new library, executes in a separate background worker process, and can not communicate directly with the user.
Errors are logged to the
comments
fields.Compatibility with older clients
The
massStoreRun()
API handler will continue to work as previously, and block while waiting for the background task to terminate, but even on the server side, it will use the new facilities to do so. In case of an error, it synchronously reports aRequestFailed
exception, passing thecomments
field (into which the background process had written the exception details) to the client.While this ensures compatibility with older clients, it is also completely "compatible" with and does not solve the previously seen bug!
Removal of special error tags from the transmission
Due to the inability of most of the exceptions previously caused in
MassStoreRun
to "escape" asRequestFailed
s, some parts of the API had been deprecated and removed.Namely,
ErrorCode.SOURCE_FILE
andErrorCode.REPORT_FORMAT
are no longer sent over the API.This does not break existing behaviour and does not cause incompatibility with clients: in cases where the request exceptions were raised earlier, now a different type of exception is raised, but the error message still precisely explains the problem as it did previously.
Solution, part 2. Exposing
massStoreRunAsynchronous()
This is the "more proper solution" foreshadowed in #4039.
Based on Part 1.'s (and earlier patches') achievements, I added the server-side implementation for
massStoreRunAsynchronous()
, which is almost trivial at this point: it performs the same spawning of the now-backgrounded task instance for the store processing and returns the task token via the API.At this point, the API handler process can fully return and start waiting for the next request.
As an added benefit, the memory associated with the potentially huge store ZIP (under the
b64zip
parameter ofmassStoreRun()
and thezipfile_blob
parameter ofmassStoreRunAsynchronous()
) is also freed at this point, because the contents of the ZIP is already extracted to storage, freeing not just "available workers" as a resource, but "memory" as well!The
CodeChecker store
client now uses the new API, receives the task token, and starts to poll the server (withCodeChecker cmd serverside-tasks --await --token TOKEN...
's infrastructure, as already established in #4318) for the status of the task.CodeChecker store
still blocks the terminal (or the executing script) until the store reaches a termination status.CodeChecker store --detach
Another benefit of putting the entire report processing logic put into the background (and removed from the API control flow) on the server is that the client can now safely close the TCP socket and disengage from the API once the server reports that it is ready to process the data. Notably, the API communication that initiates the store process,
massStoreRunAsynchronous()
, terminates when the task token is sent back to the client. Everything afterwards happens under new communication requests established by the client.This immediately allowed me to implement a small but powerful new feature into the client,
--detach
: if the user so wishes to NOT await the status of the server-side processing automatically, they can instruct theCodeChecker store
client to exit and return control to the terminal/script, once the server-side operation had begun.This is extremely useful in situations where running the
store
operation is the last action to be taken and no further commands (such ascmd results
orcmd diff
) are executed which might rely on the "success" or "completion" or "fulfilment" of the act. Consider a CI system which only updates the reports from themaster
branch in a scheduled way. These jobs can be sped up (by virtue of returning the entire allocated CI entity to a resting state) by providing--detach
.