The execution substrate for distributed AI engineering on Azure.
One engineer, a fleet of safely-parallel sandboxes, 100× the work.
Status: alpha · internal · single-region eastus2
A working Azure deployment of a Kubernetes-native, Kata-isolated sandbox runtime where
engineers and their agents run code they want to be able to throw away. The stack is wired
end-to-end: any LLM agent or developer SDK can call Sandbox.create, get back a running
Kata-isolated pod, execute code, and read results. Kimi K2.5 in Microsoft Foundry is one
worked example; the path generalises to GPT, Claude, or any other model the platform team
chooses to wire in.
Everything in this repo describes the Azure landing zone — the AKS cluster, Kata node pool, ACR Premium with private endpoint, Azure Firewall egress, Event Hubs/Stream Analytics audit pipeline, Workload Identity, ACA control plane, and the deployment IaC.
The bottleneck for distributed AI engineering at most enterprises is not the model. It's the execution substrate the agents run against. An AI agent that cannot reliably and safely execute its own generated code, in parallel, with the same identity and audit story as a human engineer, is a chatbot — not a teammate.
The project's premise is one sentence:
An engineer with a fleet of safely-parallel sandboxes can do 100× more distributed AI engineering than the same engineer typing into a single shell.
To make that real inside an enterprise, four properties have to hold simultaneously, and they're hard to get together — which is the gap this repo fills:
-
Hard isolation, because blast radius is the bottleneck. Engineers run code they want to throw away — LLM-generated, half-baked, disposable, dozens of variants at once. On a shared-kernel container, one bad
pip installorrm -rfpoisons the AKS node and everyone else's work on it. The trust boundary here is Kata Containers — per-pod VM-grade isolation with its own guest kernel, not a Linux namespace. A misbehaving agent (or, in the rare case, a hostile one) earns an empty disposable VM and nothing else. -
Per-user identity, end to end. Every sandbox action traces to a real Entra ID identity via OBO → Workload Identity → projected SA token. No shared service-principal "platform" account erasing the user in audit logs. An agent acting on behalf of an engineer leaves the engineer's name on every API call, every egress flow, every secret fetch.
-
Deny-by-default at every layer. Azure Policy in Deny mode, Cilium L7 FQDN allowlist on egress, Azure Firewall Premium as the network backstop, Kubernetes RBAC bound to Entra groups, ACR public access disabled with a private endpoint, signed images at admission. Anything not explicitly permitted is dropped — and the audit pipeline sees it drop.
-
Reproducible IaC. Bicep for the landing zone, Helm for the runtime. The whole thing should rebuild from a fresh subscription with
az deployment sub createplus a Helm install — no clicks, no tribal knowledge.
A single engineer driving a fleet of these sandboxes can:
- Parallelise grunt work — run hundreds of small evaluations, dependency upgrades, security scans, doc generations, or migrations concurrently, each in its own Kata pod, each attributed to the engineer, each isolated from the others.
- Let agents iterate safely — give a Kimi-K2.5 / GPT / Claude agent a sandbox it can write to, break, and discard, without ever touching the engineer's laptop or the shared infra. The agent's blast radius is the inside of one VM.
- Treat code execution as a cloud primitive —
Sandbox.create()is as cheap to call asBlobClient.upload(). Once the substrate is solid, the engineer stops thinking about where code runs and starts thinking about what runs. - Keep the auditor happy — every command, every egress destination, every secret read flows into Event Hubs → Stream Analytics → Log Analytics inside ≤60 s, attributed to a real user OID. There is no "AI used my creds and we don't know what it did" failure mode.
This is the substrate. The agents, the evaluation harnesses, the migration scripts — those are what you build on top of it. The repo is the thing nobody wants to build twice.
Resource groups rg-opensandbox-dev (cluster + control plane) and rg-opensandbox-demo (ACR),
both in eastus2.
| Layer | Resource | Notes |
|---|---|---|
| Cluster | aks-opensandbox-dev |
Kubernetes 1.34.7, Azure CNI Overlay + Cilium dataplane, ACNS + Hubble UI |
| System pool | 3 nodes (runc) | Sandbox controller, server, ingress, system addons |
| Kata pool | Kata Containers, kata-vm-isolation runtime class |
Cloud Hypervisor (MSHV), inner-VM kernel 6.6.130.1-3.azl3 (Azure Linux 3) |
| Container registry | acropensandboxdemo7075 (ACR Premium) |
Public access disabled, private endpoint pe-acr-opensandbox-dev (10.10.12.6), private DNS zone privatelink.azurecr.io |
| Egress firewall | afw-opensandbox-dev (Azure Firewall Premium) |
Private IP 10.10.10.4, policy afwp-opensandbox-dev, two rule collection groups (rcg-aks-bootstrap p100, rcg-sandbox-egress p200), deny-all at p300 |
| Sandbox UDR | rt-snet-kata-dev |
Forces 0.0.0.0/0 from snet-kata to the firewall |
| Audit pipeline | Event Hubs evhns-opensandbox-dev (LocalAuthDisabled) → Stream Analytics asa-opensandbox-audit-dev → blob stasadevse3bwihj3in4s/audit-fast |
Event hub sandbox-audit-fast, 4 partitions; ASA uses system-assigned MI with EH Data Receiver + Storage Blob Data Contributor |
| Control plane (ACA) | acaenv-opensandbox-dev in snet-aca |
3 container apps |
| Foundry | aihubeastus26267492086 |
Kimi-K2.5 + Kimi-K2.6 deployments |
| Workload identity | id-kimi-demo-dev |
Federated to the demo namespace's service account |
| Key Vault | kv-opensandbox-dev |
Private endpoint pe-kv-opensandbox-dev |
Two call paths, both bottoming out in the same controller + Kata sandbox pod.
Path A — Laptop SDK (sdk_e2e.py)
================================
developer laptop AKS cluster (aks-opensandbox-dev)
+-----------------+ +------------------------------------------+
| | | |
| Sandbox.create | --HTTP--> | sandbox server (FastAPI) |
| Python SDK | kubectl | | |
| | port-forward | v creates BatchSandbox CR |
| api-key auth | :18080 | sandbox controller-manager (Go) |
+-----------------+ | | |
| v schedules pod onto Kata pool |
| +----------------------------------+ |
| | Sandbox pod | |
| | runtimeClassName: | |
| | kata-vm-isolation | |
| | | |
| | init: execd (v1.0.8, CRLF-fixed| |
| | sidecar: execd daemon )| |
| | user container: python:3.12 )| |
| +----------------------------------+ |
+------------------------------------------+
|
v egress via UDR
Azure Firewall (allowlist)
|
v
pypi / npm / proxy.golang.org
Path B — Kimi agentic app (kimi_via_osb.py)
============================================
Kimi-K2.5 / K2.6 ----(AAD bearer)----> Microsoft Foundry (aihubeastus...)
^ |
| code in <code>...</code> | generated Python
| v
+----+-------------------------------------------------------------+
| kimi_via_osb.py — extracts code, hands to the sandbox SDK |
+------------------------------------------------------------------+
|
v (same path as A from here)
sandbox server -> controller -> Kata pod
|
v
python3 /tmp/kimi_code.py inside the sandbox
|
v
result returned to the agent
A deeper diagram with all eleven components, the VNet/subnet table, identity flow, and the egress data path lives in docs/ARCHITECTURE.md.
These steps assume an operator with cluster-admin (or equivalent kubelogin) access to
aks-opensandbox-dev and a Python 3.11+ environment.
# 0. Auth + cluster context
az login
az aks get-credentials -g rg-opensandbox-dev -n aks-opensandbox-dev --overwrite-existing
# 1. Install the sandbox Python SDK
pip install opensandbox
# 2. Port-forward the sandbox server to localhost:18080
# (service exposes port 80 — NOT 8080)
kubectl -n opensandbox-system port-forward svc/opensandbox-server 18080:80 &
# 3. Drop the server API key into examples/ for the demo scripts to read
kubectl -n opensandbox-system get secret opensandbox-server -o jsonpath='{.data.OPENSANDBOX_SERVER_API_KEY}' \
| base64 -d > examples/.opensandbox-api-key
# 4a. Run the laptop SDK demo
python examples/sdk_e2e.py
# 4b. Run the Kimi agentic demo
# (requires az login with access to aihubeastus26267492086)
python examples/kimi_via_osb.py
# 4c. Or drive everything from the dev portal — cluster lifecycle, swarm runs,
# sandbox create, Kimi chat, chart-in-browser demo, observability.
# (uses the port-forward + key file from steps 2 & 3.)
cd apps/portal-api
uv sync && uv run uvicorn app.main:app --port 8090
# then open http://localhost:8090| Path | Purpose |
|---|---|
third_party/opensandbox/ |
Third-party sandbox runtime, vendored. Do not edit; sync via the upstream-sync workflow. |
infra/bicep/ |
Subscription-scope Bicep for the Azure landing zone (cluster, ACR, firewall, audit). |
infra/helm/opensandbox/ |
Helm chart deploying the sandbox runtime images (controller, server, execd) with Azure-specific values. |
apps/ |
apps/control-plane/ — initial FastAPI control plane on ACA. apps/portal-api/ — dev portal (FastAPI, 24 routes) joining the in-cluster control plane, Kimi chat, swarm runner, cluster lifecycle, and observability into one local surface. apps/portal-frontend/dist/ — Alpine.js single-page command center served by portal-api at http://localhost:8090. See apps/portal-api/README.md, docs/PORTAL-AUTH.md, and ROADMAP.md. |
sdks/ |
Azure-flavored SDK wrappers and examples. |
examples/ |
Runnable demos: laptop SDK, Kimi agentic app, hypothesis swarm. See docs/DEMO-HYPOTHESIS-SWARM.md. |
docs/ |
This documentation set. |
runbooks/ |
Ops runbooks: incident response, onboarding, CVE response, DR drill. |
- docs/ARCHITECTURE.md — full architecture deep-dive, VNet table, identity and egress flows, image supply chain, failure modes, the CRLF bootstrap story.
- docs/OPERATIONS.md — runbook index, cluster health checklist, image onboarding, API key rotation, execd rebuild and roll-out.
- docs/index.md — entry point linking to everything above.
- docs/acceptance-checklist.md — the 34 acceptance criteria for v1.
- ROADMAP.md — what is done, what is deferred, what is next.
There are exactly two delta points against third_party/opensandbox/:
goproxy.cn→proxy.golang.orgin the build for Azure-region pulls.- CRLF protection in
bootstrap.sh(the script must be LF-only orexecdinit crashes the sandbox before the daemon attaches). Enforced by.gitattributes. See docs/ARCHITECTURE.md#the-crlf-bootstrap-story.
DarkForge — the Azure landing zone, IaC, docs, and SDK wrappers under this repo — is
licensed under the MIT LICENSE.
The sandbox runtime under third_party/opensandbox/ is the
upstream alibaba/OpenSandbox project (Apache
License 2.0, © Alibaba Group and contributors). It is vendored unmodified except for the
two patches listed above, both of which are documented in
THIRD_PARTY_LICENSES.md as required by Apache-2.0 §4(b).
The upstream LICENSE is preserved at
third_party/opensandbox/LICENSE and applies to all
files in that directory. See also NOTICE.
