Skip to content

container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195

Merged
atnair-amd merged 20 commits into
mainfrom
atnair/container-lifetime
Jun 10, 2026
Merged

container backend: container.lifetime schema, setup_script provisioning, and persistent mode#195
atnair-amd merged 20 commits into
mainfrom
atnair/container-lifetime

Conversation

@atnair-amd

Copy link
Copy Markdown
Collaborator

Summary

Reworks the container execution backend around a single container lifecycle policy and adds in-container provisioning + a persistent (install-then-run) mode.

  • Schema: replaces the two-axis container.enabled + container.launch with one tri-valued container.lifetime (external / per_run / persistent). Both legacy keys now hard-error with a migration message (no silent mapping), so a stale flag can't override an explicit lifetime.
  • Provisioning: adds container.setup_script, run inside each freshly-launched container before sshd setup (base64-delivered over the existing docker exec channel, with set -o pipefail and a size guard). The packaged default installs openssh-server, so the base image no longer needs sshd baked in.
  • Persistent mode: setup_containers attaches when the container runs on every host, cold-starts when it runs on none, and hard-fails on a partial set (relaunching would force-remove and wipe the still-running hosts' overlay). Includes a per-host image-SHA consistency check (cross-host skew or unreadable SHA is a hard error) and an idempotent setup_sshd (skips when sshd is already on 2224). Teardown is a no-op, enabling cvs run install_rvs then cvs run rvs_cvs in separate invocations.
  • Launch pull: the container-start docker run timeout is raised 60s -> 900s (CONTAINER_START_TIMEOUT_S) so the orchestrator can pull a multi-GB image on a cold node instead of timing out.
  • Docs (cluster-file, run-with-containers, cluster_file README) and cluster_container.json sample updated; enabled/launch references removed.

Test plan

  • Unit: test_container.py, test_factory.py, test_docker.py (lifetime resolution, per-lifetime setup/teardown, persistent attach/partial/skew/unreadable-SHA, provisioning dispatch + size guard + pipefail payload, sshd pgrep self-match) — 72 pass.
  • Lint: ruff check clean on changed files.
  • End-to-end on an MI300X node (container orchestrator, lifetime: persistent):
    • install_rvs cold-starts a container, provisions sshd into a no-sshd TheRock image, installs RVS into the overlay; container survives teardown.
    • rvs_cvs attaches to the same container (verified identical container ID + StartedAt, no relaunch), skips sshd setup, and runs rvs from the persisted overlay.
    • Verified the launch path pulls a fresh 61GB image on a cold node (~9 min) within the new 900s timeout (the old 60s timeout aborted the pull).

atnair-amd added 17 commits May 29, 2026 13:59
…er.lifetime

Add _resolve_container_lifetime() and invoke it from OrchestratorConfig.__init__,
the single chokepoint that from_configs routes through. container.enabled is
removed (hard ValueError if present); container.launch becomes a deprecated alias
(true->per_run, false->external) emitting a DeprecationWarning. Default lifetime
is per_run. Docstrings rewritten to the new schema.
DockerRuntime no longer interprets container.launch now that lifetime resolution
lives in OrchestratorConfig. Add image_sha_status() to compare a running
container's image SHA against the local image tag per host (used by the
persistent attach path), and mirror it in the ContainerRuntime protocol and the
Enroot stub.
setup_containers verifies-only for external, launches for per_run, and
attaches-or-launches for persistent (with a per-host image-SHA check; cross-host
SHA skew is a hard error). teardown is a no-op for external/persistent and
force-removes for per_run. Extract the launch path into _launch_containers(); add
a pgrep precheck so setup_sshd is idempotent on persistent re-runs. Drop the dead
container_enabled attribute.
…sample

Replace the enabled/launch keys (and their comments) with a single
lifetime: per_run plus a comment describing the three policies.
…behavior

Migrate the container/factory/docker unit tests off enabled+launch. Add
_resolve_container_lifetime table cases (enabled->ValueError, launch->
DeprecationWarning, invalid->ValueError, defaults), persistent attach/launch/
idempotency, cross-host SHA skew, stale-overlay warn, and setup_sshd idempotency.
Rewrite the cluster-file README/RST and the run-with-containers how-to around the
lifetime truth table (external/per_run/persistent), add a persistent attach
sequence diagram, drop the removed enabled/stale-name pitfalls, and add the
pin-container.name guidance for persistent.
Minimal apt script run inside a freshly-launched container to install
openssh-server, so CVS's in-container sshd can start on base images that
do not pre-ship it. Short-circuits if sshd is already present.
Add _resolve_container_setup_script alongside the lifetime resolver: a
user-supplied path is made absolute and must exist (ValueError at config
load otherwise); when absent/null/empty it falls back to the packaged
default, which is itself existence-checked so a broken install fails fast
rather than as an OSError mid-run.
After a fresh launch (per_run, or persistent cold-start), run the resolved
setup_script inside each container via docker exec before setup_sshd, so
base images lacking sshd/packages are brought up to spec. external and
persistent-attach skip provisioning (idempotent across runs). The script is
shipped base64-encoded inline; oversized scripts and read errors fail with a
clear message, and every failing host's output is logged.
…arent shell

pgrep -f 'sshd.*2224' matched its own parent shell, whose argv contains the
pattern, so the precheck always reported sshd already running and skipped
starting it (and the post-start validation always passed). Use the [s]shd
character-class trick at both sites so they match the real daemon only.
Add the optional setup_script key to cluster_container.json and the schema
README, and drop the requirement that the image pre-ship openssh-server
(now installed at launch by the default setup_script). Note the inline
delivery constraints (bash/base64 in the image, ~16 KB raw cap) and that
null/omit both mean "use the packaged default".
Update the how-to and cluster-file reference: add the setup_script field,
describe the launch-time provisioning step, and soften the "image must
contain openssh-server" prerequisite/pitfall.
Test _resolve_container_setup_script: falsy (absent/null/empty) -> packaged
default, missing user path raises, relative/tilde paths resolve to abspath,
the packaged default is present, and a missing default raises.
…guard

Add provisioning coverage (fresh-launch dispatch matrix, size guard, payload
bytes, read/oversize failures, all-host failure logging) and pin the [s]shd
precheck/validation pattern. Restructure with setUp-based patching and
subTest tables in place of the per-method @patch boilerplate.
Post-review hardening of the container persistent-lifetime feature:

- container: branch persistent setup_containers on per-host running state.
  Attach when the container runs on every host, cold-start when it runs on
  no host, and hard-fail (no relaunch) when it runs on some hosts but not
  all -- relaunching force-removes the still-running hosts and destroys
  their overlay, the opposite of what persistent promises.
- container: add `set -o pipefail` to the inline setup_script delivery so a
  missing or failing base64 in the image fails loudly instead of silently
  no-opping and later surfacing as an opaque sshd startup failure.
- container/runtimes: fail the persistent image-SHA check when a host's SHA
  is unreadable (previously a silent pass), and drop the always-zero
  exit_code from image_sha_status and the ContainerRuntime protocol since
  the wrapping echo makes it meaningless.
- factory: remove the container.launch deprecation mapping. launch now
  hard-errors like the already-removed enabled, so a stale launch flag can
  never silently override an explicit lifetime.
- runtimes/docker: raise the container-start timeout from 60s to 900s
  (CONTAINER_START_TIMEOUT_S) so the launch `docker run` can pull a
  multi-GB image on a cold node instead of timing out at one minute.
- unittests + cluster-file docs: updated for the new contracts.
`make test` runs `ruff format --check` over the tree. Collapse the
multi-line calls/raises that fit the configured line length in
container.py and factory.py (from the previous commit) and in
test_factory.py (pre-existing drift). Formatting only, no behavior change.
Rename the verify-only container lifetime value from 'external' to
'no_launch', which describes the behavior (CVS never launches the
container) rather than an ownership model. Updates the validation
tuple, branch checks, removed-field error messages, unit tests, the
cluster_container.json sample comment, and the cluster-file docs /
README truth tables. The feature is unreleased so no alias is kept.
@atnair-amd atnair-amd self-assigned this Jun 1, 2026
@atnair-amd atnair-amd requested a review from cijohnson June 3, 2026 00:27
Comment thread cvs/core/orchestrators/factory.py Outdated
self.password = kwargs.get('password')
self.head_node_dict = kwargs.get('head_node_dict', {})
self.container = kwargs.get('container', {})
# Normalize the container block. This is the single chokepoint:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this comment?

@atnair-amd atnair-amd Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

OrchestratorConfig -- so neither key is present here.
"""
cfg = {
"enabled": True,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont see lifetime arg

@atnair-amd atnair-amd Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added lifetime arg

Comment thread cvs/core/runtimes/docker.py Outdated
}
return out

def image_sha_status(self, container_name, image_name):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see i used only in unittests, do we need this?

@atnair-amd atnair-amd Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@atnair-amd atnair-amd force-pushed the atnair/container-lifetime branch from 8d6b907 to c3e3b2c Compare June 9, 2026 22:57
Collapse the two-axis container config (enabled + launch) into a single
tri-valued container.lifetime: no_launch, per_run (default), persistent.

- factory: resolve/validate container.lifetime; enabled and launch are
  removed fields that hard-error with migration guidance.
- container: branch setup_containers/teardown_containers on lifetime.
  persistent attaches when the container runs on every host (with a
  per-host image-SHA check), cold-starts when it runs on no host, and
  hard-fails when it runs on some hosts but not all (relaunching would
  force-remove and rebuild the still-running hosts, destroying their
  overlay).
- runtimes: add image_sha_status to the ContainerRuntime protocol and
  the docker/enroot backends; drop the launch short-circuit so the
  runtime always renders docker run when invoked.
- cluster file sample, README, and docs: document the lifetime truth
  table (replacing enabled/launch).
- unittests: cover lifetime resolution and per-lifetime setup/teardown.
@atnair-amd atnair-amd force-pushed the atnair/container-lifetime branch from c3e3b2c to ec87e4f Compare June 10, 2026 02:46
@atnair-amd atnair-amd merged commit 12f330a into main Jun 10, 2026
2 checks passed
amd-droy pushed a commit that referenced this pull request Jun 25, 2026
…ng, and persistent mode (#195)

* orchestrators/factory: collapse container enabled+launch into container.lifetime

Add _resolve_container_lifetime() and invoke it from OrchestratorConfig.__init__,
the single chokepoint that from_configs routes through. container.enabled is
removed (hard ValueError if present); container.launch becomes a deprecated alias
(true->per_run, false->external) emitting a DeprecationWarning. Default lifetime
is per_run. Docstrings rewritten to the new schema.

* runtimes: drop launch short-circuit, add image_sha_status

DockerRuntime no longer interprets container.launch now that lifetime resolution
lives in OrchestratorConfig. Add image_sha_status() to compare a running
container's image SHA against the local image tag per host (used by the
persistent attach path), and mirror it in the ContainerRuntime protocol and the
Enroot stub.

* orchestrators/container: branch setup/teardown on container.lifetime

setup_containers verifies-only for external, launches for per_run, and
attaches-or-launches for persistent (with a per-host image-SHA check; cross-host
SHA skew is a hard error). teardown is a no-op for external/persistent and
force-removes for per_run. Extract the launch path into _launch_containers(); add
a pgrep precheck so setup_sshd is idempotent on persistent re-runs. Drop the dead
container_enabled attribute.

* input/cluster_file: use container.lifetime in cluster_container.json sample

Replace the enabled/launch keys (and their comments) with a single
lifetime: per_run plus a comment describing the three policies.

* core/unittests: cover container.lifetime resolution and per-lifetime behavior

Migrate the container/factory/docker unit tests off enabled+launch. Add
_resolve_container_lifetime table cases (enabled->ValueError, launch->
DeprecationWarning, invalid->ValueError, defaults), persistent attach/launch/
idempotency, cross-host SHA skew, stale-overlay warn, and setup_sshd idempotency.

* docs: document container.lifetime schema (replace enabled/launch)

Rewrite the cluster-file README/RST and the run-with-containers how-to around the
lifetime truth table (external/per_run/persistent), add a persistent attach
sequence diagram, drop the removed enabled/stale-name pitfalls, and add the
pin-container.name guidance for persistent.

* orchestrators/scripts: add default container provisioning script

Minimal apt script run inside a freshly-launched container to install
openssh-server, so CVS's in-container sshd can start on base images that
do not pre-ship it. Short-circuits if sshd is already present.

* orchestrators/factory: resolve and validate container.setup_script

Add _resolve_container_setup_script alongside the lifetime resolver: a
user-supplied path is made absolute and must exist (ValueError at config
load otherwise); when absent/null/empty it falls back to the packaged
default, which is itself existence-checked so a broken install fails fast
rather than as an OSError mid-run.

* orchestrators/container: provision launched containers via setup_script

After a fresh launch (per_run, or persistent cold-start), run the resolved
setup_script inside each container via docker exec before setup_sshd, so
base images lacking sshd/packages are brought up to spec. external and
persistent-attach skip provisioning (idempotent across runs). The script is
shipped base64-encoded inline; oversized scripts and read errors fail with a
clear message, and every failing host's output is logged.

* orchestrators/container: fix setup_sshd pgrep precheck matching its parent shell

pgrep -f 'sshd.*2224' matched its own parent shell, whose argv contains the
pattern, so the precheck always reported sshd already running and skipped
starting it (and the post-start validation always passed). Use the [s]shd
character-class trick at both sites so they match the real daemon only.

* input/cluster_file: document and sample container.setup_script

Add the optional setup_script key to cluster_container.json and the schema
README, and drop the requirement that the image pre-ship openssh-server
(now installed at launch by the default setup_script). Note the inline
delivery constraints (bash/base64 in the image, ~16 KB raw cap) and that
null/omit both mean "use the packaged default".

* docs: document container.setup_script in cluster-file references

Update the how-to and cluster-file reference: add the setup_script field,
describe the launch-time provisioning step, and soften the "image must
contain openssh-server" prerequisite/pitfall.

* orchestrators/unittests: cover container.setup_script resolution

Test _resolve_container_setup_script: falsy (absent/null/empty) -> packaged
default, missing user path raises, relative/tilde paths resolve to abspath,
the packaged default is present, and a missing default raises.

* orchestrators/unittests: cover container provisioning and sshd pgrep guard

Add provisioning coverage (fresh-launch dispatch matrix, size guard, payload
bytes, read/oversize failures, all-host failure logging) and pin the [s]shd
precheck/validation pattern. Restructure with setUp-based patching and
subTest tables in place of the per-method @patch boilerplate.

* orchestrators/runtimes: harden container persistent lifetime

Post-review hardening of the container persistent-lifetime feature:

- container: branch persistent setup_containers on per-host running state.
  Attach when the container runs on every host, cold-start when it runs on
  no host, and hard-fail (no relaunch) when it runs on some hosts but not
  all -- relaunching force-removes the still-running hosts and destroys
  their overlay, the opposite of what persistent promises.
- container: add `set -o pipefail` to the inline setup_script delivery so a
  missing or failing base64 in the image fails loudly instead of silently
  no-opping and later surfacing as an opaque sshd startup failure.
- container/runtimes: fail the persistent image-SHA check when a host's SHA
  is unreadable (previously a silent pass), and drop the always-zero
  exit_code from image_sha_status and the ContainerRuntime protocol since
  the wrapping echo makes it meaningless.
- factory: remove the container.launch deprecation mapping. launch now
  hard-errors like the already-removed enabled, so a stale launch flag can
  never silently override an explicit lifetime.
- runtimes/docker: raise the container-start timeout from 60s to 900s
  (CONTAINER_START_TIMEOUT_S) so the launch `docker run` can pull a
  multi-GB image on a cold node instead of timing out at one minute.
- unittests + cluster-file docs: updated for the new contracts.

* orchestrators: apply ruff format to satisfy the fmt-check gate

`make test` runs `ruff format --check` over the tree. Collapse the
multi-line calls/raises that fit the configured line length in
container.py and factory.py (from the previous commit) and in
test_factory.py (pre-existing drift). Formatting only, no behavior change.

* orchestrators: rename container lifetime 'external' to 'no_launch'

Rename the verify-only container lifetime value from 'external' to
'no_launch', which describes the behavior (CVS never launches the
container) rather than an ownership model. Updates the validation
tuple, branch checks, removed-field error messages, unit tests, the
cluster_container.json sample comment, and the cluster-file docs /
README truth tables. The feature is unreleased so no alias is kept.

* orchestrators: replace container enabled/launch with container.lifetime

Collapse the two-axis container config (enabled + launch) into a single
tri-valued container.lifetime: no_launch, per_run (default), persistent.

- factory: resolve/validate container.lifetime; enabled and launch are
  removed fields that hard-error with migration guidance.
- container: branch setup_containers/teardown_containers on lifetime.
  persistent attaches when the container runs on every host (with a
  per-host image-SHA check), cold-starts when it runs on no host, and
  hard-fails when it runs on some hosts but not all (relaunching would
  force-remove and rebuild the still-running hosts, destroying their
  overlay).
- runtimes: add image_sha_status to the ContainerRuntime protocol and
  the docker/enroot backends; drop the launch short-circuit so the
  runtime always renders docker run when invoked.
- cluster file sample, README, and docs: document the lifetime truth
  table (replacing enabled/launch).
- unittests: cover lifetime resolution and per-lifetime setup/teardown.

* address review: trim normalize comment, carry lifetime in runtime test config

* tests: drop redundant persistent-idempotent test, strip dead warning guard
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants