From 5816f318743961f3e0eecaa2c5250aea4b98a4c7 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 26 May 2016 22:47:52 -0700 Subject: [PATCH 1/5] runtime: Replace "process is stopped" with "process exits" proc(5) describes the following state entries in proc/[pid]/stat [1] (for modern kernels): * R Running * S Sleeping in an interruptible wait * D Waiting in uninterruptible disk sleep * Z Zombie * T Stopped (on a signal) * t Tracing stop * X Dead and ps(1) has a bit more context [2] (for modern kernels): * D uninterruptible sleep (usually IO) * R running or runnable (on run queue) * S interruptible sleep (waiting for an event to complete) * T stopped by job control signal * t stopped by debugger during the tracing * X dead (should never be seen) * Z defunct ("zombie") process, terminated but not reaped by its parent So I expect "stopped" to mean "process still exists but is paused, e.g. by SIGSTOP". And I expect "exited" to mean "process has finished and is either a zombie or dead". After this commit, 'git grep -i stop' only turns up poststop-hook stuff, a reference in principles.md, a "stoppage" in LICENSE, and some ChangeLog entries. Also replace "container's process" with "container process" to match usage in the rest of the repository. After this commit: $ git grep -i "container process" | wc -l 16 $ git grep -i "container's process" | wc -l 1 Also reword status entries to avoid "running", which is less precise in our spec (e.g. it also includes "sleeping", "waiting", ...). Also removes a "them" leftover from a partial plural -> singular reroll of be594153 (Split create and start, 2016-04-01, #384). [1]: http://man7.org/linux/man-pages/man5/proc.5.html [2]: http://man7.org/linux/man-pages/man1/ps.1.html Signed-off-by: W. Trevor King --- runtime.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/runtime.md b/runtime.md index 2c99f501d..4bcf7bfbe 100644 --- a/runtime.md +++ b/runtime.md @@ -15,9 +15,9 @@ This MUST be unique across all containers on this host. There is no requirement that it be unique across hosts. * **`status`**: (string) is the runtime state of the container. The value MAY be one of: - * `created` : the container has been created but the user-specified code has not yet been executed - * `running` : the container has been created and the user-specified code is running - * `stopped` : the container has been created and the user-specified code has been executed but is no longer running + * `created` : the container process has neither exited nor executed the user-specified code + * `running` : the container process has executed the user-specified code but has not exited + * `stopped` : the container process has exited Additional values MAY be defined by the runtime, however, they MUST be used to represent new runtime states not defined above. * **`pid`**: (int) is the ID of the main process within the container, as seen by the host. @@ -55,8 +55,8 @@ The lifecycle describes the timeline of events that happen from when a container However, some actions might only be available based on the current state of the container (e.g. only available while it is started). 4. Runtime's `start` command is invoked with the unique identifier of the container. The runtime MUST run the user-specified code, as specified by [`process`](config.md#process-configuration). -5. The container's process is stopped. - This MAY happen due to them erroring out, exiting, crashing or the runtime's `kill` operation being invoked. +5. The container process exits. + This MAY happen due to erroring out, exiting, crashing or the runtime's `kill` operation being invoked. 6. Runtime's `delete` command is invoked with the unique identifier of the container. The container MUST be destroyed by undoing the steps performed during create phase (step 2). From f0bbefbffef9ecfc9b6b182bf319b0779f296ebe Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 23 Jun 2016 14:25:00 -0700 Subject: [PATCH 2/5] runtime: Add 'creating' to state status To distinguish between "we're still setting this container up" and "we're finished setting up; you can call 'start' if you like". Also reference the lifecycle steps, because you can't be too explicit Signed-off-by: W. Trevor King --- runtime.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/runtime.md b/runtime.md index 4bcf7bfbe..5098eb3d2 100644 --- a/runtime.md +++ b/runtime.md @@ -15,9 +15,10 @@ This MUST be unique across all containers on this host. There is no requirement that it be unique across hosts. * **`status`**: (string) is the runtime state of the container. The value MAY be one of: - * `created` : the container process has neither exited nor executed the user-specified code - * `running` : the container process has executed the user-specified code but has not exited - * `stopped` : the container process has exited + * `creating` : the container is being created (step 2 in the [lifecycle](#lifecycle)) + * `created` : the runtime has finished the [create operation](#create) (after step 2 in the [lifecycle](#lifecycle)), and the container process has neither exited nor executed the user-specified code + * `running` : the container process has executed the user-specified code but has not exited (after step 4 in the [lifecycle](#lifecycle)) + * `stopped` : the container process has exited (step 5 in the [lifecycle](#lifecycle)) Additional values MAY be defined by the runtime, however, they MUST be used to represent new runtime states not defined above. * **`pid`**: (int) is the ID of the main process within the container, as seen by the host. From 1a962c0439cdde98c24208691d4708ad3ab8b7e4 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 23 Jun 2016 14:29:30 -0700 Subject: [PATCH 3/5] runtime: Only require 'pid' in the state for created/running statuses Because during the 'creating' phase we may not have a container process yet (e.g. if we're still reading the configuration or setting up cgroups), and in the 'stopped' phase the PID is no longer meaningful. Signed-off-by: W. Trevor King --- runtime.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/runtime.md b/runtime.md index 5098eb3d2..febb74e9d 100644 --- a/runtime.md +++ b/runtime.md @@ -7,13 +7,13 @@ Whether other entities using the same, or other, instance of the runtime can see ## State -The state of a container MUST include, at least, the following properties: +The state of a container includes the following properties: -* **`ociVersion`**: (string) is the OCI specification version used when creating the container. -* **`id`**: (string) is the container's ID. +* **`ociVersion`** (string, required) is the OCI specification version used when creating the container. +* **`id`** (string, required) is the container's ID. This MUST be unique across all containers on this host. There is no requirement that it be unique across hosts. -* **`status`**: (string) is the runtime state of the container. +* **`status`** (string, required) is the runtime state of the container. The value MAY be one of: * `creating` : the container is being created (step 2 in the [lifecycle](#lifecycle)) * `created` : the runtime has finished the [create operation](#create) (after step 2 in the [lifecycle](#lifecycle)), and the container process has neither exited nor executed the user-specified code @@ -21,12 +21,14 @@ The value MAY be one of: * `stopped` : the container process has exited (step 5 in the [lifecycle](#lifecycle)) Additional values MAY be defined by the runtime, however, they MUST be used to represent new runtime states not defined above. -* **`pid`**: (int) is the ID of the main process within the container, as seen by the host. -* **`bundlePath`**: (string) is the absolute path to the container's bundle directory. +* **`pid`** (int, required when `status` is `created` or `running`) is the ID of the main process within the container, as seen by the host. +* **`bundlePath`**: (string, required) is the absolute path to the container's bundle directory. This is provided so that consumers can find the container's configuration and root filesystem on the host. -* **`annotations`**: (map) contains the list of annotations associated with the container. +* **`annotations`**: (map, required) contains the list of annotations associated with the container. If no annotations were provided then this property MAY either be absent or an empty map. +The state MAY include additional properties. + When serialized in JSON, the format MUST adhere to the following pattern: ```json From 54ae256f10282c8f19551225ea65d0bd16598b7d Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 23 Jun 2016 14:37:00 -0700 Subject: [PATCH 4/5] runtime: Add an 'event' operation for subscribing to pushes The current 'state' operation allows callers to poll for the state, but for some workflows polling is inefficient (how frequently do you poll to balance the cost of polling against the timeliness of the response?) and push notifications make more sense. The runtime's 'create' process is in a unique position to detect these status transitions. * As the actor carrying out container creation, it should have a clear idea of when that creation completes (for the 'created' event). * It may setup a communication channel with the container process to orchestrate creation, and that channel may be used to report the start event. * It knows (or is) the parent of the container process, and POSIX's wait(3), waitpid(3), and waitid(3) only work for child processes [1,2]. From [1]: Nothing in this volume of POSIX.1-2008 prevents an implementation from providing extensions that permit a process to get status from a grandchild or any other process, but a process that does not use such extensions must be guaranteed to see status from only its direct children. So the runtime can setup (or be) the parent waiting on the container process, and arrange for the 'stopped' event to be published on container exit. I've tried to phrase the requirements conservatively to allow for runtimes that have to poll their kernel or some such to notice these changes. I see the following runtime-support cases: a. The runtime can easily supply a push-based event operation. In this case, exposing that operation to callers faciliates push-based workflows without much cost. b. The runtime cannot supply a push-based event operation, and has to emulate it by polling. In this case, the runtime can pick a polling strategy that makes sense to its maintainers, and callers who aren't satisfied with that strategy can roll their own state poller without a big efficiency hit (in a lesser of two evils way). The requirement is currently worded so weakly that a runtime would be compliant with: 1. Container process dies at noon. 2. User calls 'state' at 5pm. 3. Runtime checks kernel, and sees that the container process is dead. 4. Runtime publishes 'stopped' event with a 5pm (and some microseconds) timestamp. 5. Runtime returns state to the user with 'stopped' in 'status'. which is a pretty low bar. c. The runtime could supply a push-based event operation, but it would be a lot of work. These runtimes can use polling (b) as a quick-and-dirty solution until someone has time to implement a push-based solution (a). The improvement is transparent to users, who can use the same event operation throughout and passively reap the benefits of the implementation improvements. Without an operation like this, higher levels that need to trigger on these transitions without polling need to exercise a lot of control over the system: * They must be the parent of (or on Linux, the nearest PR_SET_CHILD_SUBREAPER ancestor of [3]) the create process, and the create process needs to exit after creation completes, if they want to block on the 'created' event. * They must proxy all 'start' and 'delete' requests if they want to block on the 'started' or 'deleted' events. * On Linux, they must be the nearest ancestor of the create operation to set PR_SET_CHILD_SUBREAPER if they want to block on the 'stopped' event. By making 'event' a runtime requirement, we allow for efficient cross-platform push-based workflows while avoiding the need for tight orchestrator gate-keeping. Runtimes that have implementation difficulties have an easy out that allows their callers to benefit from future implemenation improvements. And callers that are not satisfied can always fall back to polling state or the proxy/waitid/PR_SET_CHILD_SUBREAPER approaches. [1]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/wait.html [2]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/waitid.html [3]: http://man7.org/linux/man-pages/man2/prctl.2.html Signed-off-by: W. Trevor King --- runtime.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/runtime.md b/runtime.md index febb74e9d..2e4e5121a 100644 --- a/runtime.md +++ b/runtime.md @@ -82,6 +82,33 @@ This operation MUST generate an error if it is not provided the ID of a containe Attempting to query a container that does not exist MUST generate an error. This operation MUST return the state of a container as specified in the [State](#state) section. +### Event + +`event ` + +This operation MUST generate an error if it is not provided the ID of a container. +Attempting to query a container that does not exist MUST generate an error. +This operation MUST subscribe the caller to push-notification about future events in chronological order. +Events MUST be published when the container's [`status`](#state) changes, and MAY be published for additional events. +Events MUST include, at least, the following properties: + +* **`type`** (string, required) The type of the event. + For the following [`status`](#state) transitions, the `type` values MUST be: + + * `creating` → `created`: `created` + * `created` → `running`: `started` + * * → `stopped`: `stopped` + + A `deleted` event MUST be published after a successful [delete operation](#delete), after which further events MUST NOT be generated. + +* **`id`** (string, required) The ID of the container which experienced the event. +* **`timestamp`** (string, required) The time at which the event took place in [`date-time` format as specified by RFC 3339][rfc3339-s5.6]. + +The `started` and `stopped` transitions happen in the kernel, and there may be a lag before the runtime notices. +For example, if the container process dies, a runtime which is [waiting][waitid.3p] on it will take some time in the `SIGCHLD` handler before adjusting the [state](#state). +A runtime that detects transitions by polling the kernel may trail the associated kernel transition by an even longer period. +So the `timestamp` and event publication may not exactly match the associated kernel transition, but they MUST match the [state](#state) transition. + ### Create `create ` @@ -131,3 +158,7 @@ Once a container is deleted its ID MAY be used by a subsequent container. ## Hooks Many of the operations specified in this specification have "hooks" that allow for additional actions to be taken before or after each operation. See [runtime configuration for hooks](./config.md#hooks) for more information. + +[rfc3339-s5.6]: https://tools.ietf.org/html/rfc3339#section-5.6 + +[waitid.3p]: http://pubs.opengroup.org/onlinepubs/9699919799/functions/waitid.html From fae7e995ec290fad32e165b6434f6116bebab384 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Thu, 23 Jun 2016 16:15:10 -0700 Subject: [PATCH 5/5] runtime: Support event buffering This is all very generic, and I expect more details to land in the runtime API specification. Requirements around the buffer-request semantics (e.g. "buffer for 15 seconds" or "buffer the last 5 events" or "buffer the 'created' event") seemed out of place at the level of detail in this specification. The goal is to allow for: $ funC create --event-buffer created ID & $ funC event --event created ID && hook1 && hook2 && funC start ID $ fg To support blocking on the event without racing on "maybe the 'created' event happened before the 'event' operation attached". Given the small number of required events, this buffering should not be a large resource concern, and '--event-buffer created' would only ever require a single event to be buffered per container. There is still a race on "maybe the container has already been destroyed and a second container has been created with the same ID before the 'event' operation attached", but that seems much less likely (especially since the caller is free to pick UUIDs). Signed-off-by: W. Trevor King --- runtime.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/runtime.md b/runtime.md index 2e4e5121a..411b4e275 100644 --- a/runtime.md +++ b/runtime.md @@ -89,6 +89,8 @@ This operation MUST return the state of a container as specified in the [State]( This operation MUST generate an error if it is not provided the ID of a container. Attempting to query a container that does not exist MUST generate an error. This operation MUST subscribe the caller to push-notification about future events in chronological order. +If the container's [create operation](#create) requested an event buffer, the buffered events MUST be published in chronological order before any future events are published. +If the runtime could not perform the requested buffering, it MUST generate an error. Events MUST be published when the container's [`status`](#state) changes, and MAY be published for additional events. Events MUST include, at least, the following properties: @@ -111,7 +113,7 @@ So the `timestamp` and event publication may not exactly match the associated ke ### Create -`create ` +`create ` This operation MUST generate an error if it is not provided a path to the bundle and the container ID to associate with the container. If the ID provided is not unique across all containers within the scope of the runtime, or is not valid in any other way, the implementation MUST generate an error and a new container MUST not be created. @@ -125,6 +127,8 @@ Runtime callers who are interested in pre-create validation can run [bundle-vali Any changes made to the [`config.json`](config.md) file after this operation will not have an effect on the container. +Runtime callers MAY request an event buffer, in which case the runtime MUST buffer events associated with the container. + ### Start `start `