General agent behavior lives in .cursor/rules/karpathy-guidelines.mdc (always applied).
Python conventions live in .cursor/rules/python-standards.mdc (auto-applied to *.py).
This file documents project-specific context an agent can't grep for.
NVIDIA AI Cloud Validation Suite - validation and management tools for NVIDIA ISV Lab GPU cluster environments. Monorepo with three Python packages managed as a uv workspace:
- isvctl - CLI controller for cluster lifecycle (setup → test → teardown)
- isvtest - Validation framework engine (pytest-based with custom discovery)
- isvreporter - Test results reporter for ISV Lab Service API
uv sync # install workspace
make build # build all packages
make test # run tests
make demo-test # run all my-isv configs end-to-end (ISVCTL_DEMO_MODE=1, ~10s, no cloud)
make lint # ruff
make format # ruff format
make plan # render docs/test-plan.yaml to AsciiDoc + interactive HTML
uv run isvctl test run -f isvctl/configs/suites/k8s.yaml # canonical invocation
uv run isvctl test run -f config.yaml -- -v -s -k "test_name" # forward pytest argsThe framework separates doing from checking:
Config (YAML) → Script (any language) → JSON output → Validations (assertions)
- Scripts (Python, Bash, ...) perform cloud operations and print structured JSON to stdout.
- Validations are simple assertions over that JSON - no cloud SDK code in validations.
- Validations reference step output via Jinja2:
"{{steps.create_network.vpc_id}}","{{region}}". The orchestrator warns when a template references a missing step or field (catchesChainableUndefinedsilent fallbacks).
- Test stdout JSON is the provider-neutral contract between scripts and validations. Use ISV-agnostic names and avoid AWS-specific resource concepts unless a validation or later step consumes them.
- Keep output minimal:
success,platform,test_name, andtests.<check>.passed/message/probesare usually enough. Omit IDs, regions, endpoint inventories, and other fields that do not affect behavior. - Failure/skip diagnostics are allowed, but keep them concise and generic:
top-level
error/error_type,tests.<check>.error,skip_reason, orcleanup_errors. Avoid raw provider responses and resource dumps.
- Phases run in order:
setup → test → teardown. - Teardown runs after setup/test failures by default so cloud resources get
cleaned up - but it is skipped when
teardown_on_failureis disabled, or when setup was requested in the same invocation but no setup steps actually ran. - Teardown is best-effort - one failing teardown step does not block the others.
- Standalone teardown (
isvctl test run -f config.yaml --phase teardown) runs unconditionally - useful after a previous run withAWS_SKIP_TEARDOWN. - Multiple
-fconfigs merge; later files override earlier ones.
Entry point: isvctl/src/isvctl/main.py (Typer).
cli/- subcommands (test,deploy,clean,docs,report)orchestrator/-loop.py(phase loop),step_executor.py(step + validation execution, supportsbest_effortmode),commands.py(timeouts),context.py(Jinja2 with missing-reference warnings)config/-schema.py(Pydantic),output_schemas.py(per-step JSON schemas),merger.py(multi-file merge)remote/-ssh.py(with jumphost),archive.py,transfer.py(SCP via jumphost proxy)cleaner/- resource cleanup
Entry point: isvtest/src/isvtest/main.py.
run_validations_via_pytest() is the bridge isvctl calls. It transforms validation
configs to pytest format, runs native pytest, and returns rich in-memory results
(category, message) alongside the exit code.
core/validation.py-BaseValidationabstract classcore/discovery.py- findsBaseValidationsubclasses and ReFrame testscore/runners.py-LocalRunner,SlurmRunner, ...core/{k8s,slurm,nvidia,ngc,workload}.py- domain helpers
Validation classes live in isvtest/src/isvtest/validations/ grouped by domain
(generic.py, cluster.py, instance.py, network.py, iam.py, security.py,
host.py, k8s_*.py, slurm_*.py, bm_*.py). Each subclass is auto-discovered.
Filtering labels live on the YAML wiring (labels: [...] per check in the suite/
provider configs), not on the class; the catalog, pytest marks, isvctl docs,
and the orchestrator's include/exclude-label filtering all read them from there.
Workloads (isvtest/src/isvtest/workloads/) are long-running tests (NIM, NCCL,
stress) labelled ("workload", "slow", ...) with manifests and helper scripts
colocated.
Test config loaded from YAML/JSON via config/loader.py. Global fixtures in
tests/conftest.py. tests/test_validations.py dynamically generates pytest tests
from BaseValidation classes.
Entry point: isvreporter/src/isvreporter/main.py (Typer).
client.py- ISV Lab Service API clientauth.py- OAuth2junit_parser.py- pytest JUnit XML parsingplatform.py- platform detection
isvctl deploy run → tarball repo (remote/archive.py) → SCP through optional
jumphost (remote/transfer.py) → install.sh on target → isvctl test run with
forwarded env vars → optional isvreporter upload.
isvtest/src/isvtest/released_tests.json- release-gating manifest owned by the release process (bumped viachore: update package versions). New checks ship unreleased and land here in a separate release commit, not in feature PRs. To exercise an unreleased check end-to-end against a config, run withISVTEST_INCLUDE_UNRELEASED=1(the orchestrator otherwise logsSkipping unreleased validation '<Name>'and the new check is a no-op).
- Workspace root
pyproject.tomldefines members; each package has its ownpyproject.toml; all source undersrc/. isvctl/configs/suites/- provider-agnostic test contracts.isvctl/configs/providers/<name>/- one folder per provider (aws/,my-isv/, ...):config/- YAML wiring (imports a suite, supplies commands)scripts/- executable scripts (Python/Bash) that do the work, organized by domain (network/,vm/,iam/,k8s/, ...)scripts/common/- provider-local Python helpers, imported via a singlesys.path.insert(0, Path(__file__).resolve().parents[1])per script
isvctl/configs/providers/shared/- cross-provider scripts (deploy_nim.py,teardown_nim.py).isvctl/schemas/- JSON Schema files.
my-isv/- scaffold for ISVs to copy. Each script has a TODO block and aDEMO_MODE = os.environ.get("ISVCTL_DEMO_MODE") == "1"gate: real run returns"Not implemented - ..."; demo mode returns dummy success.make demo-testsetsISVCTL_DEMO_MODE=1. Seeproviders/my-isv/scripts/README.md.aws/- fully implemented reference using boto3/Terraform.aws/scripts/common/providesec2,errors(withdelete_with_retry),ssh_utils.wait_for_ssh,serial_console,vpc.
| Variable | Description | Used by |
|---|---|---|
ISV_SERVICE_ENDPOINT |
ISV Lab Service API endpoint | isvreporter |
ISV_SSA_ISSUER |
ISV Lab Service SSA issuer | isvreporter |
ISV_CLIENT_ID / ISV_CLIENT_SECRET |
ISV Lab Service credentials | isvreporter |
NGC_API_KEY |
NGC key for NIM workloads / container registry | isvtest, isvctl |
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION |
AWS auth | AWS scripts |
KUBECTL |
Optional kubectl-compatible CLI prefix (POSIX shlex in Python, word-split in shell; overrides K8S_PROVIDER detection) |
isvtest get_kubectl_command, isvctl k8s scripts |
ISVCTL_DEMO_MODE |
"1" makes my-isv scripts return dummy success |
scripts |
AWS_SKIP_TEARDOWN |
Skip teardown phase (run later with --phase teardown) |
AWS configs |
ISVCTL_CONFIG / ISVCTL_SECRETS |
Override the config.yml / secrets.yml paths (default ${XDG_CONFIG_HOME:-~/.config}/isvctl/) |
isvctl configure, test, doctor |
isvctl configure persists env vars so users don't re-export them per shell.
On disk they are grouped into provider-namespaced sections (nico.api_base ⇆
NICO_API_BASE); non-secret values go in config.yml (0644), secrets in
secrets.yml (0600). The variable catalog and the section/prefix mapping live
in isvctl/config/env_catalog.py (shared with doctor); the section⇆env-name
translation is a serialization detail in config/user.py (the public API stays
keyed by env var name). Both files carry a top-level version: (the on-disk
schema version, SCHEMA_VERSION in config/user.py); a missing version reads as
the initial 1, and a version newer than the build understands is rejected with
a clear "upgrade isvctl" error rather than mis-parsed. Secret-vs-non-secret
routing reuses redaction.is_secret_env_var. The "Flags" group is non-persistable.
test run, test validate, and doctor apply both files (unless
--no-user-config) via cli/common.apply_user_config, and an already-exported
var always wins (process env > files > defaults).
- uv is installed via
pip install uv(the~/.local/binpath must be onPATH). uv syncfrom the workspace root is the only install step; it creates.venv/with all three packages in editable mode.- The
uv-buildversion warning duringuv sync/make buildis cosmetic and does not affect functionality. make demo-testis the best quick E2E smoke test — it runs allmy-isvprovider configs in demo mode (~8 s, no cloud credentials needed).- For unit tests,
make testruns all three packages plusscripts/tests/. - For linting,
make lintusesuvxto run a pinned ruff version (no global install required). - DCO sign-off required: All commits must include a
Signed-off-byline (enforced by the DCO Probot check on PRs). Usegit commit --signofforgit commit -sfor every commit. - Pre-commit checks: Run
uvx pre-commit run -abefore committing to catch formatting, linting, SPDX header, and link issues early. - No external services (databases, containers, clusters) are needed for local development or testing. All cloud-dependent tests require explicit credentials (
AWS_*,NGC_API_KEY, etc.) and are skipped in demo mode.