From 52c10624754c6eaa3707eea2f5f18bafe0e97eca Mon Sep 17 00:00:00 2001 From: bruntib Date: Wed, 8 Apr 2026 15:11:45 +0200 Subject: [PATCH] [doc] Extend task management docs Some further information was written in the commit message of the task management implementation. These are now saved in the official documentations. --- docs/web/background_tasks.md | 37 +++++++++++++++++++++++++---- docs/web/server_config.md | 46 +++++++++++++++++++++++++++++++++++- 2 files changed, 78 insertions(+), 5 deletions(-) diff --git a/docs/web/background_tasks.md b/docs/web/background_tasks.md index b53fedf176..2d0d167b20 100644 --- a/docs/web/background_tasks.md +++ b/docs/web/background_tasks.md @@ -32,10 +32,13 @@ Tasks are generally spawned by API handlers, executed in the control flow of a T 1. An **API** request arrives (later, this might be extended with a _`cron`_ -like scheduler) which exercises an endpoint that results in the need for a task. 2. _(Optionally)_ some conformance checks are executed on the input, in order to not even create the task if the input is ill-formed. -3. A task **`token`** is _`ALLOCATED`_: the record is written into the database, and now we have a unique identifier for the task. -4. The task is **pushed** to the _task queue_ of the CodeChecker server, resulting in the _`ENQUEUED`_ status. -5. The task's identifier **`token`** is returned to the user. +3. A task **`token`** is _`ALLOCATED`_: the **`BackgroundTask`** record is written into the database, and now we have a unique identifier for the task. +4. The task is **pushed** to a shared, synchronised _task queue_ of the CodeChecker server, resulting in the _`ENQUEUED`_ status. + * `AbstractTask` subclasses **MUST** be `pickle`-able and reasonably small. + * The library offers means to store additional large data on the file system, in a temporary directory specific to the task. +5. The **`task token`** is returned to the user via the RPC API call, and the API worker is free too respond to other requests. 6. The API hander exits and the Thrift RPC connection is terminated. +7. In a loop with some frequency, the user exercises the `getTaskInfo()` API (executed in the context of any _API worker_ process, synchronised over the database) to query whether the task was completed, if the user wishes to receive this information. The API request dispatching of the CodeChecker server has a **`TaskManager`** instance which should be passed to the API handler implementation, if not already available. Then, you can use this _`TaskManager`_ object to perform the necessary actions to enqueue the execution of a task: @@ -118,7 +121,7 @@ The business logic of tasks are implemented by subclassing the _`AbstractTask`_ 4. The implementation does its thing, periodically calling _`task_manager.heartbeat()`_ to update the progress timestamp of the task, and, if appropriate, checking with _`task_manager.should_cancel()`_ whether the admins requested the task to cancel or the server is shutting down. 5. If _`should_cancel()`_ returned `True`, the task does some appropriate clean-up, and exits by raising the special _`TaskCancelHonoured`_ exception, indicating that it responded to the request. (At this point, the status becomes either _`CANCELLED`_ or _`DROPPED`_, depending on the circumstances of the service.) 6. Otherwise, or if the task is for some reason not cancellable without causing damage, the task executes its logic. -7. If the task's _`_implementation()`_ method exits cleanly, it reaches the _`COMPLETED`_ status; otherwise, if any exception escapes from the _`_implementation()`_ method, the task becomes _`FAILED`_. +7. If the task's _`_implementation()`_ method exits cleanly, it reaches the _`COMPLETED`_ status; otherwise, if any exception escapes from the _`_implementation()`_ method, the task becomes _`FAILED`_, and exception information is logged into the `BackgroundTask.comments` column of the database. **Caution!** Tasks, executing in a separate background process part of the many processes spawned by a CodeChecker server, no longer have the ability to synchronously communicate with the user! This also includes the lack of ability to "return" a value: tasks **only exercise side-effects**, but do not calculate a "result". @@ -170,6 +173,32 @@ class MyTask(AbstractTask): foo(element) ``` +### Abnormal path 1: admin cancellation + +At any point following _`ALLOCATED`_ status, but most likely in the _`ENQUEUED`_ and _`RUNNING`_ statuses, a **`SUPERUSER`** may issue a _`cancelTask()`_ order. +This will set `BackgroundTask.cancel_flag`, and the task is expected (although not required!) to poll its own _`should_cancel()`_ status internally in checkpoints, and terminate gracefully to this request. This is done by **`_implementation()`** exiting by raising a **`TaskCancelHonoured`** exception. +(If the task does not raise one, it will be allowed to conclude normally, or fail in some other manner. +Tasks cancelled gracefully will have the _`CANCELLED`_ status. + +For example, a background task that performs an action over a set of input files generally should be implemented like this: + +```py3 +def _implementation(tm: TaskManager): + for file in INPUTS: + if tm.should_cancel(self): + ROLLBACK() + raise TaskCancelHonoured(self) + + DO_LOGIC(file) +``` + +### Abnormal path 2: server shutdown + +Alternatively, at any point in this life cycle, the server might receive the command to terminate itself (kill signals `SIGINT`, `SIGTERM`; alternatively caused by `CodeChecker server --stop`). Following the termination of _API workers_, the _background workers_ will also shut down one by one. +At this point, the default behaviour is to cause a special _cancel event_ which tasks currently _`RUNNING`_ may still gracefully honour, as-if it was a `SUPERUSER`'s single-task cancel request. All other tasks that have not started executing yet and are in the _`ALLOCATED`_ or _`ENQUEUED`_ status will never start. + +All tasks not in a _normal termination state_ will be set to the _`DROPPED`_ status, with the `comments` field containing a log about the specifics of in which state the task was dropped, and why. (Together, _`CANCELLED`_ and _`DROPPED`_ are the _"abnormal termination states"_, indicating that the task terminated due to some external influence.) + Client-side handling -------------------- diff --git a/docs/web/server_config.md b/docs/web/server_config.md index 23f6e98ffd..1f1bf8e217 100644 --- a/docs/web/server_config.md +++ b/docs/web/server_config.md @@ -9,15 +9,28 @@ using the package's installed `config/server_config.json` as a template. Table of Contents ================= +* [Task handling](#task-handling) + * [Number of API worker processes](#number-of-api-worker-processes) + * [Number of task worker processes](#number-of-task-worker-processes) + * [Run limitation](#run-limitations) * [Storage](#storage) * [Directory of analysis statistics](#directory-of-analysis-statistics) * [Limits](#Limits) * [Maximum size of failure zips](#maximum-size-of-failure-zips) * [Size of the compilation database](#size-of-the-compilation-database) + * [Keepalive](#keepalive) + * [Idle time](#idle-time) + * [Interval time](#interval-time) + * [Probes](#probes) * [Authentication](#authentication) +* [Secrets](#secrets) + * [server_secrets.json](#server_secretsjson) + * [Environmental variables](#environmental-variables) + +## Task handling -## Number of API worker processes +### Number of API worker processes The `worker_processes` section of the config file controls how many processes will be started on the server to process API requests. @@ -33,6 +46,37 @@ processes will be started on the server to process background jobs. The server needs to be restarted if the value is changed in the config file. +### `--machine-id` +Unfortunately, servers don't always terminate gracefully (cue the aforementioned +`SIGKILL`, but also the container, VM, or the host machine could simply die +during execution, in ways the server is not able to handle). Because tasks are +not shared across server processes, and there are crucial bits of information in +the now dead process's memory which would have been needed to execute the task, +a server later restarting in place of a previously dead one should be able to +identify which tasks its "predecessor" left behind without clean-up. + +This is achieved by storing the running computer's identifier, configurable via +`CodeChecker server --machine-id`, as an additional piece of information for +each task. By default, the machine ID is constructed from +`gethostname():portnumber`, e.g., `cc-server:80`. + +In containerised environments, relying on `gethostname()` may not be entirely +stable! For example, Docker exposes the first 12 digits of the container's +unique hash as the _"hostname"_ of the insides of the container. If the +container is started with `--restart always` or `--restart unless-stopped`, then +this is fine, however, more advanced systems, such as _Docker swarm_ will +**create a new container** in case the old one died (!), resulting in a new +value of `gethostname()`. + +In such environments, service administrators must pay additional caution and +configure their instances by setting `--machine-id` for subsequent executions of +the "same" server accordingly. If a server with machine ID **`M`** starts up +(usually after a container or "system" restart), it will set every task not in +any "termination states" and associated with machine ID **`M`** to the +_`DROPPED`_ status (with an appropriately formatted comment accompanying), +signifying that the _previous instance_ "dropped" these tasks, but had no chance +of recording this fact. + ## Run limitation The `max_run_count` section of the config file controls how many runs can be stored on the server for a product.