DarkForge

The execution substrate for distributed AI engineering on Azure.
One engineer, a fleet of safely-parallel sandboxes, 100× the work.

_{Status: alpha · internal · single-region eastus2}

A working Azure deployment of a Kubernetes-native, Kata-isolated sandbox runtime where engineers and their agents run code they want to be able to throw away. The stack is wired end-to-end: any LLM agent or developer SDK can call Sandbox.create, get back a running Kata-isolated pod, execute code, and read results. Kimi K2.5 in Microsoft Foundry is one worked example; the path generalises to GPT, Claude, or any other model the platform team chooses to wire in.

Everything in this repo describes the Azure landing zone — the AKS cluster, Kata node pool, ACR Premium with private endpoint, Azure Firewall egress, Event Hubs/Stream Analytics audit pipeline, Workload Identity, ACA control plane, and the deployment IaC.

Why this exists — motivation & philosophy

The bottleneck for distributed AI engineering at most enterprises is not the model. It's the execution substrate the agents run against. An AI agent that cannot reliably and safely execute its own generated code, in parallel, with the same identity and audit story as a human engineer, is a chatbot — not a teammate.

The project's premise is one sentence:

An engineer with a fleet of safely-parallel sandboxes can do 100× more distributed AI engineering than the same engineer typing into a single shell.

To make that real inside an enterprise, four properties have to hold simultaneously, and they're hard to get together — which is the gap this repo fills:

Hard isolation, because blast radius is the bottleneck. Engineers run code they want to throw away — LLM-generated, half-baked, disposable, dozens of variants at once. On a shared-kernel container, one bad pip install or rm -rf poisons the AKS node and everyone else's work on it. The trust boundary here is Kata Containers — per-pod VM-grade isolation with its own guest kernel, not a Linux namespace. A misbehaving agent (or, in the rare case, a hostile one) earns an empty disposable VM and nothing else.
Per-user identity, end to end. Every sandbox action traces to a real Entra ID identity via OBO → Workload Identity → projected SA token. No shared service-principal "platform" account erasing the user in audit logs. An agent acting on behalf of an engineer leaves the engineer's name on every API call, every egress flow, every secret fetch.
Deny-by-default at every layer. Azure Policy in Deny mode, Cilium L7 FQDN allowlist on egress, Azure Firewall Premium as the network backstop, Kubernetes RBAC bound to Entra groups, ACR public access disabled with a private endpoint, signed images at admission. Anything not explicitly permitted is dropped — and the audit pipeline sees it drop.
Reproducible IaC. Bicep for the landing zone, Helm for the runtime. The whole thing should rebuild from a fresh subscription with az deployment sub create plus a Helm install — no clicks, no tribal knowledge.

The 100× thesis

A single engineer driving a fleet of these sandboxes can:

Parallelise grunt work — run hundreds of small evaluations, dependency upgrades, security scans, doc generations, or migrations concurrently, each in its own Kata pod, each attributed to the engineer, each isolated from the others.
Let agents iterate safely — give a Kimi-K2.5 / GPT / Claude agent a sandbox it can write to, break, and discard, without ever touching the engineer's laptop or the shared infra. The agent's blast radius is the inside of one VM.
Treat code execution as a cloud primitive — Sandbox.create() is as cheap to call as BlobClient.upload(). Once the substrate is solid, the engineer stops thinking about where code runs and starts thinking about what runs.
Keep the auditor happy — every command, every egress destination, every secret read flows into Event Hubs → Stream Analytics → Log Analytics inside ≤60 s, attributed to a real user OID. There is no "AI used my creds and we don't know what it did" failure mode.

This is the substrate. The agents, the evaluation harnesses, the migration scripts — those are what you build on top of it. The repo is the thing nobody wants to build twice.

What's actually running right now

Resource groups rg-opensandbox-dev (cluster + control plane) and rg-opensandbox-demo (ACR), both in eastus2.

Layer	Resource	Notes
Cluster	`aks-opensandbox-dev`	Kubernetes 1.34.7, Azure CNI Overlay + Cilium dataplane, ACNS + Hubble UI
System pool	3 nodes (runc)	Sandbox controller, server, ingress, system addons
Kata pool	Kata Containers, `kata-vm-isolation` runtime class	Cloud Hypervisor (MSHV), inner-VM kernel `6.6.130.1-3.azl3` (Azure Linux 3)
Container registry	`acropensandboxdemo7075` (ACR Premium)	Public access disabled, private endpoint `pe-acr-opensandbox-dev` (10.10.12.6), private DNS zone `privatelink.azurecr.io`
Egress firewall	`afw-opensandbox-dev` (Azure Firewall Premium)	Private IP 10.10.10.4, policy `afwp-opensandbox-dev`, two rule collection groups (`rcg-aks-bootstrap` p100, `rcg-sandbox-egress` p200), deny-all at p300
Sandbox UDR	`rt-snet-kata-dev`	Forces 0.0.0.0/0 from `snet-kata` to the firewall
Audit pipeline	Event Hubs `evhns-opensandbox-dev` (LocalAuthDisabled) → Stream Analytics `asa-opensandbox-audit-dev` → blob `stasadevse3bwihj3in4s/audit-fast`	Event hub `sandbox-audit-fast`, 4 partitions; ASA uses system-assigned MI with EH Data Receiver + Storage Blob Data Contributor
Control plane (ACA)	`acaenv-opensandbox-dev` in snet-aca	3 container apps
Foundry	`aihubeastus26267492086`	Kimi-K2.5 + Kimi-K2.6 deployments
Workload identity	`id-kimi-demo-dev`	Federated to the `demo` namespace's service account
Key Vault	`kv-opensandbox-dev`	Private endpoint `pe-kv-opensandbox-dev`

Architecture at a glance

Two call paths, both bottoming out in the same controller + Kata sandbox pod.

Path A — Laptop SDK (sdk_e2e.py)
================================

   developer laptop                       AKS cluster (aks-opensandbox-dev)
  +-----------------+                +------------------------------------------+
  |                 |                |                                          |
  |  Sandbox.create | --HTTP-->      |  sandbox server (FastAPI)                |
  |  Python SDK     |  kubectl       |     |                                    |
  |                 |  port-forward  |     v  creates BatchSandbox CR           |
  |  api-key auth   |  :18080        |  sandbox controller-manager (Go)         |
  +-----------------+                |     |                                    |
                                     |     v  schedules pod onto Kata pool      |
                                     |  +----------------------------------+   |
                                     |  | Sandbox pod                      |   |
                                     |  |   runtimeClassName:              |   |
                                     |  |   kata-vm-isolation              |   |
                                     |  |                                  |   |
                                     |  |   init: execd (v1.0.8, CRLF-fixed|   |
                                     |  |   sidecar: execd daemon         )|   |
                                     |  |   user container: python:3.12   )|   |
                                     |  +----------------------------------+   |
                                     +------------------------------------------+
                                                       |
                                                       v   egress via UDR
                                                  Azure Firewall (allowlist)
                                                       |
                                                       v
                                                 pypi / npm / proxy.golang.org


Path B — Kimi agentic app (kimi_via_osb.py)
============================================

  Kimi-K2.5 / K2.6  ----(AAD bearer)----> Microsoft Foundry (aihubeastus...)
       ^                                       |
       | code in <code>...</code>              | generated Python
       |                                       v
  +----+-------------------------------------------------------------+
  | kimi_via_osb.py — extracts code, hands to the sandbox SDK         |
  +------------------------------------------------------------------+
                                       |
                                       v   (same path as A from here)
                              sandbox server -> controller -> Kata pod
                                       |
                                       v
                              python3 /tmp/kimi_code.py inside the sandbox
                                       |
                                       v
                                    result returned to the agent

A deeper diagram with all eleven components, the VNet/subnet table, identity flow, and the egress data path lives in docs/ARCHITECTURE.md.

Quickstart — reproduce the two demos

These steps assume an operator with cluster-admin (or equivalent kubelogin) access to aks-opensandbox-dev and a Python 3.11+ environment.

# 0. Auth + cluster context
az login
az aks get-credentials -g rg-opensandbox-dev -n aks-opensandbox-dev --overwrite-existing

# 1. Install the sandbox Python SDK
pip install opensandbox

# 2. Port-forward the sandbox server to localhost:18080
#    (service exposes port 80 — NOT 8080)
kubectl -n opensandbox-system port-forward svc/opensandbox-server 18080:80 &

# 3. Drop the server API key into examples/ for the demo scripts to read
kubectl -n opensandbox-system get secret opensandbox-server -o jsonpath='{.data.OPENSANDBOX_SERVER_API_KEY}' \
  | base64 -d > examples/.opensandbox-api-key

# 4a. Run the laptop SDK demo
python examples/sdk_e2e.py

# 4b. Run the Kimi agentic demo
#     (requires az login with access to aihubeastus26267492086)
python examples/kimi_via_osb.py

# 4c. Or drive everything from the dev portal — cluster lifecycle, swarm runs,
#     sandbox create, Kimi chat, chart-in-browser demo, observability.
#     (uses the port-forward + key file from steps 2 & 3.)
cd apps/portal-api
uv sync && uv run uvicorn app.main:app --port 8090
# then open http://localhost:8090

Repository layout

Path	Purpose
`third_party/opensandbox/`	Third-party sandbox runtime, vendored. Do not edit; sync via the upstream-sync workflow.
`infra/bicep/`	Subscription-scope Bicep for the Azure landing zone (cluster, ACR, firewall, audit).
`infra/helm/opensandbox/`	Helm chart deploying the sandbox runtime images (controller, server, execd) with Azure-specific values.
`apps/`	`apps/control-plane/` — initial FastAPI control plane on ACA. `apps/portal-api/` — dev portal (FastAPI, 24 routes) joining the in-cluster control plane, Kimi chat, swarm runner, cluster lifecycle, and observability into one local surface. `apps/portal-frontend/dist/` — Alpine.js single-page command center served by portal-api at `http://localhost:8090`. See `apps/portal-api/README.md`, `docs/PORTAL-AUTH.md`, and ROADMAP.md.
`sdks/`	Azure-flavored SDK wrappers and examples.
`examples/`	Runnable demos: laptop SDK, Kimi agentic app, hypothesis swarm. See `docs/DEMO-HYPOTHESIS-SWARM.md`.
`docs/`	This documentation set.
`runbooks/`	Ops runbooks: incident response, onboarding, CVE response, DR drill.

Documentation

docs/ARCHITECTURE.md — full architecture deep-dive, VNet table, identity and egress flows, image supply chain, failure modes, the CRLF bootstrap story.
docs/OPERATIONS.md — runbook index, cluster health checklist, image onboarding, API key rotation, execd rebuild and roll-out.
docs/index.md — entry point linking to everything above.
docs/acceptance-checklist.md — the 34 acceptance criteria for v1.
ROADMAP.md — what is done, what is deferred, what is next.

Patches against the vendored runtime

There are exactly two delta points against third_party/opensandbox/:

goproxy.cn → proxy.golang.org in the build for Azure-region pulls.
CRLF protection in bootstrap.sh (the script must be LF-only or execd init crashes the sandbox before the daemon attaches). Enforced by .gitattributes. See docs/ARCHITECTURE.md#the-crlf-bootstrap-story.

Acknowledgement and licenses

DarkForge — the Azure landing zone, IaC, docs, and SDK wrappers under this repo — is licensed under the MIT LICENSE.

The sandbox runtime under third_party/opensandbox/ is the upstream alibaba/OpenSandbox project (Apache License 2.0, © Alibaba Group and contributors). It is vendored unmodified except for the two patches listed above, both of which are documented in THIRD_PARTY_LICENSES.md as required by Apache-2.0 §4(b). The upstream LICENSE is preserved at third_party/opensandbox/LICENSE and applies to all files in that directory. See also NOTICE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DarkForge

Why this exists — motivation & philosophy

The 100× thesis

What's actually running right now

Architecture at a glance

Quickstart — reproduce the two demos

Repository layout

Documentation

Patches against the vendored runtime

Acknowledgement and licenses

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
.claude		.claude
.github/workflows		.github/workflows
.omc		.omc
apps		apps
assets		assets
docs		docs
examples		examples
infra		infra
runbooks		runbooks
scripts		scripts
sdks		sdks
tests		tests
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
ROADMAP.md		ROADMAP.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md

Folders and files

Latest commit

History

Repository files navigation

DarkForge

Why this exists — motivation & philosophy

The 100× thesis

What's actually running right now

Architecture at a glance

Quickstart — reproduce the two demos

Repository layout

Documentation

Patches against the vendored runtime

Acknowledgement and licenses

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages