Skip to content

Commit 3c7bd93

Browse files
rwipfelnvclaudecvkjw3
authored
feat: add Kubernetes testing infrastructure (#227)
* feat: add Kubernetes testing infrastructure Add k8s-testing/ directory with scripts and manifests for testing NemoClaw on Kubernetes with Dynamo vLLM inference. Includes: - test-installer.sh: Public installer test (requires unattended install support) - setup.sh: Manual setup from source for development - Pod manifests for Docker-in-Docker execution Architecture: OpenShell runs k3s inside Docker, so we use DinD pods to provide the Docker daemon on Kubernetes. Signed-off-by: rwipfelnv * fix: add socat proxy for K8s DNS isolation workaround OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names, so inference requests fail with 502 Bad Gateway. This adds: - socat TCP proxy setup in setup.sh to forward localhost:8000 to the K8s vLLM service endpoint - Provider configuration using host.openshell.internal:8000 which resolves to the workspace container from inside k3s - Documentation explaining the network architecture and workaround - Updated env var names to match PR #318 (NEMOCLAW_NON_INTERACTIVE) - cgroup v2 compatibility fix for Docker daemon - Removed memory limits that caused OOM Tested: Inference requests from sandboxes now route correctly through the socat proxy to the Dynamo vLLM endpoint. Depends on: #318 (non-interactive mode), #365 (Dynamo provider) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: NemoKlaw - NemoClaw on Kubernetes with Dynamo support Complete K8s deployment solution for NemoClaw: - nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage - install.sh: Interactive installer with preflight checks - Rename k8s-testing -> k8s, move old files to dev/ Key learnings: - hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction - Init containers for docker config, openshell CLI, NemoClaw build - Workspace container installs apt packages at runtime (can't share via volumes) - socat proxy bridges K8s DNS to nested k3s (host.openshell.internal) Tested successfully with Dynamo vLLM backend on EKS. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * fix: rename NemoKlaw to NemoClaw and document known limitations Address PR feedback: - Rename NemoKlaw -> NemoClaw (avoid confusing naming) - Rename nemoklaw.yaml -> nemoclaw-k8s.yaml - Fix hardcoded endpoint to use generic example - Remove log file from repo - Document known limitations (HTTPS proxy issue) - Update README with accurate status of what works/doesn't work Signed-off-by: rwipfelnv Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: update DYNAMO_HOST to vllm-agg-frontend The aggregated frontend service is the correct endpoint for Dynamo vLLM inference. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * docs: add Using NemoClaw section with CLI commands - Add workspace shell access command - Add sandbox status/logs/list commands - Add chat completion test example - Rename section from "What Can You Do?" to "Using NemoClaw" Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * refactor(k8s): simplify deployment to use official installer - Use official NemoClaw installer (`curl | bash`) instead of git clone/build - Switch to `custom` provider from PR #648 (supersedes dynamo-specific provider) - Remove k8s/dev/ directory (no longer needed for testing) - Use emptyDir volumes for portability across clusters - Add /etc/hosts workaround for endpoint validation during onboarding - Update README with verification steps for local inference Tested end-to-end with Dynamo vLLM backend. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(k8s): resolve lint errors in yaml and markdown - Remove multi-document YAML (move namespace creation to README) - Add language specifier to fenced code block (```text) - Add blank lines before lists per markdownlint rules Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * docs(k8s): add experimental warning and clarify requirements - Add explicit experimental warning at top of README - Clarify this is for trying NemoClaw on k8s, not production - Document privileged pod and DinD requirements upfront - Add resource requirements to prerequisites Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> --------- Signed-off-by: rwipfelnv Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: KJ <kejones@nvidia.com>
1 parent f59f58e commit 3c7bd93

File tree

2 files changed

+324
-0
lines changed

2 files changed

+324
-0
lines changed

k8s/README.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# NemoClaw on Kubernetes
2+
3+
> **⚠️ Experimental**: This deployment method is intended for **trying out NemoClaw on Kubernetes**, not for production use. It requires a **privileged pod** running **Docker-in-Docker (DinD)** to create isolated sandbox environments. Operational requirements (storage, runtime, security policies) vary by cluster configuration.
4+
5+
Run [NemoClaw](https://github.com/NVIDIA/NemoClaw) on Kubernetes with GPU inference powered by [Dynamo](https://github.com/ai-dynamo/dynamo) or any OpenAI-compatible endpoint.
6+
7+
---
8+
9+
## Quick Start
10+
11+
### Prerequisites
12+
13+
- Kubernetes cluster with `kubectl` access
14+
- An OpenAI-compatible inference endpoint (Dynamo vLLM, vLLM, etc.)
15+
- Permissions to create **privileged pods** (required for Docker-in-Docker)
16+
- Sufficient node resources (~8GB memory, 2 CPUs for DinD container)
17+
18+
### 1. Deploy NemoClaw
19+
20+
```bash
21+
kubectl create namespace nemoclaw
22+
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/k8s/nemoclaw-k8s.yaml
23+
```
24+
25+
### 2. Check Logs
26+
27+
```bash
28+
kubectl logs -f nemoclaw -n nemoclaw -c workspace
29+
```
30+
31+
Wait for "Onboard complete" message.
32+
33+
### 3. Connect to Your Sandbox
34+
35+
```bash
36+
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
37+
```
38+
39+
You're now inside a secure sandbox with an AI agent ready to help.
40+
41+
---
42+
43+
## Configuration
44+
45+
Edit the environment variables in `nemoclaw-k8s.yaml` before deploying:
46+
47+
| Variable | Required | Description |
48+
|----------|----------|-------------|
49+
| `DYNAMO_HOST` | Yes | Inference endpoint for socat proxy (e.g., `vllm-frontend.dynamo.svc:8000`) |
50+
| `NEMOCLAW_ENDPOINT_URL` | Yes | URL the sandbox uses (usually `http://host.openshell.internal:8000/v1`) |
51+
| `COMPATIBLE_API_KEY` | Yes | API key (use `dummy` for Dynamo/vLLM) |
52+
| `NEMOCLAW_MODEL` | Yes | Model name (e.g., `meta-llama/Llama-3.1-8B-Instruct`) |
53+
| `NEMOCLAW_SANDBOX_NAME` | No | Sandbox name (default: `my-assistant`) |
54+
55+
### Example: Custom Endpoint
56+
57+
```yaml
58+
env:
59+
- name: DYNAMO_HOST
60+
value: "my-vllm.my-namespace.svc.cluster.local:8000"
61+
- name: NEMOCLAW_ENDPOINT_URL
62+
value: "http://host.openshell.internal:8000/v1"
63+
- name: COMPATIBLE_API_KEY
64+
value: "dummy"
65+
- name: NEMOCLAW_MODEL
66+
value: "mistralai/Mistral-7B-Instruct-v0.3"
67+
```
68+
69+
---
70+
71+
## Using NemoClaw
72+
73+
### Access the Workspace Shell
74+
75+
```bash
76+
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- bash
77+
```
78+
79+
### Check Sandbox Status
80+
81+
```bash
82+
kubectl exec nemoclaw -n nemoclaw -c workspace -- nemoclaw list
83+
kubectl exec nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant status
84+
```
85+
86+
### Connect to Sandbox
87+
88+
```bash
89+
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
90+
```
91+
92+
### Test Inference
93+
94+
From inside the sandbox:
95+
96+
```bash
97+
curl -s https://inference.local/v1/models
98+
99+
curl -s https://inference.local/v1/chat/completions \
100+
-H "Content-Type: application/json" \
101+
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
102+
```
103+
104+
### Verify Local Inference
105+
106+
Confirm NemoClaw is using your Dynamo/vLLM endpoint:
107+
108+
```bash
109+
# Check model from sandbox
110+
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
111+
sandbox@my-assistant:~$ curl -s https://inference.local/v1/models
112+
# Should show your model (e.g., meta-llama/Llama-3.1-8B-Instruct)
113+
114+
# Compare with Dynamo directly (from workspace)
115+
kubectl exec nemoclaw -n nemoclaw -c workspace -- curl -s http://localhost:8000/v1/models
116+
# Should show the same model
117+
118+
# Check provider configuration
119+
kubectl exec nemoclaw -n nemoclaw -c workspace -- openshell inference get
120+
# Shows: Provider: compatible-endpoint, Model: <your-model>
121+
122+
# Test the agent
123+
sandbox@my-assistant:~$ openclaw agent --agent main -m "What is 7 times 8?"
124+
# Should respond with 56
125+
```
126+
127+
---
128+
129+
## Architecture
130+
131+
```text
132+
┌─────────────────────────────────────────────────────────────────┐
133+
│ Kubernetes Cluster │
134+
│ │
135+
│ ┌───────────────────────────────────────────────────────────┐ │
136+
│ │ NemoClaw Pod │ │
137+
│ │ │ │
138+
│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │
139+
│ │ │ Docker-in-Docker│ │ Workspace Container │ │ │
140+
│ │ │ │ │ │ │ │
141+
│ │ │ ┌───────────┐ │ │ nemoclaw CLI │ │ │
142+
│ │ │ │ k3s │ │◄───│ openshell CLI │ │ │
143+
│ │ │ │ cluster │ │ │ │ │ │
144+
│ │ │ │ │ │ │ socat proxy ───────────────│───│──┼──► Dynamo/vLLM
145+
│ │ │ │ ┌───────┐ │ │ │ localhost:8000 │ │ │
146+
│ │ │ │ │Sandbox│ │ │ │ │ │ │
147+
│ │ │ │ └───────┘ │ │ │ host.openshell.internal │ │ │
148+
│ │ │ └───────────┘ │ │ routes to socat │ │ │
149+
│ │ └─────────────────┘ └─────────────────────────────┘ │ │
150+
│ └───────────────────────────────────────────────────────────┘ │
151+
└─────────────────────────────────────────────────────────────────┘
152+
```
153+
154+
**How it works:**
155+
156+
1. NemoClaw runs in a privileged pod with Docker-in-Docker
157+
2. OpenShell creates a nested k3s cluster for sandbox isolation
158+
3. A socat proxy bridges K8s DNS to the nested environment
159+
4. Inside the sandbox, `host.openshell.internal:8000` routes to the inference endpoint
160+
161+
---
162+
163+
## Troubleshooting
164+
165+
### Pod won't start
166+
167+
```bash
168+
kubectl describe pod nemoclaw -n nemoclaw
169+
```
170+
171+
Common issues:
172+
173+
- Missing privileged security context
174+
- Insufficient memory (needs ~8GB for DinD)
175+
176+
### Docker daemon not starting
177+
178+
```bash
179+
kubectl logs nemoclaw -n nemoclaw -c dind
180+
```
181+
182+
Usually resolves after 30-60 seconds.
183+
184+
### Inference not working
185+
186+
Check socat is running:
187+
188+
```bash
189+
kubectl exec nemoclaw -n nemoclaw -c workspace -- pgrep -a socat
190+
```
191+
192+
Test endpoint directly:
193+
194+
```bash
195+
kubectl exec nemoclaw -n nemoclaw -c workspace -- curl -s http://localhost:8000/v1/models
196+
```
197+
198+
---
199+
200+
## Learn More
201+
202+
- [NemoClaw Documentation](https://docs.nvidia.com/nemoclaw)
203+
- [OpenShell](https://github.com/NVIDIA/OpenShell)
204+
- [Dynamo](https://github.com/ai-dynamo/dynamo)
205+
- [OpenClaw](https://openclaw.ai)

k8s/nemoclaw-k8s.yaml

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# NemoClaw on Kubernetes
2+
# Uses official installer with Docker-in-Docker for sandbox isolation.
3+
# Prerequisites: kubectl create namespace nemoclaw
4+
apiVersion: v1
5+
kind: Pod
6+
metadata:
7+
name: nemoclaw
8+
namespace: nemoclaw
9+
labels:
10+
app: nemoclaw
11+
spec:
12+
containers:
13+
# Docker daemon (DinD)
14+
- name: dind
15+
image: docker:24-dind
16+
securityContext:
17+
privileged: true
18+
env:
19+
- name: DOCKER_TLS_CERTDIR
20+
value: ""
21+
command: ["dockerd", "--host=unix:///var/run/docker.sock"]
22+
volumeMounts:
23+
- name: docker-storage
24+
mountPath: /var/lib/docker
25+
- name: docker-socket
26+
mountPath: /var/run
27+
- name: docker-config
28+
mountPath: /etc/docker
29+
resources:
30+
requests:
31+
memory: "8Gi"
32+
cpu: "2"
33+
34+
# Workspace - runs official NemoClaw installer
35+
- name: workspace
36+
image: node:22
37+
command:
38+
- bash
39+
- -c
40+
- |
41+
set -e
42+
43+
# Install packages
44+
echo "[1/4] Installing packages..."
45+
apt-get update -qq
46+
apt-get install -y -qq docker.io socat curl >/dev/null 2>&1
47+
48+
# Start socat proxy for K8s DNS bridge
49+
echo "[2/4] Starting socat proxy..."
50+
socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
51+
# Add hosts entry so validation can reach socat via host.openshell.internal
52+
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
53+
sleep 1
54+
55+
# Wait for Docker
56+
echo "[3/4] Waiting for Docker daemon..."
57+
for i in $(seq 1 30); do
58+
if docker info >/dev/null 2>&1; then break; fi
59+
sleep 2
60+
done
61+
docker info >/dev/null 2>&1 || { echo "Docker not ready"; exit 1; }
62+
echo "Docker ready"
63+
64+
# Run official NemoClaw installer
65+
echo "[4/4] Running NemoClaw installer..."
66+
curl -fsSL https://nvidia.com/nemoclaw.sh | bash
67+
68+
# Keep running after onboard
69+
echo "Onboard complete. Container staying alive."
70+
exec sleep infinity
71+
env:
72+
- name: DOCKER_HOST
73+
value: unix:///var/run/docker.sock
74+
# Dynamo endpoint (raw host:port for socat) - UPDATE THIS FOR YOUR CLUSTER
75+
- name: DYNAMO_HOST
76+
value: "vllm-agg-frontend.dynamo.svc.cluster.local:8000"
77+
# NemoClaw config (uses host.openshell.internal via socat)
78+
- name: NEMOCLAW_NON_INTERACTIVE
79+
value: "1"
80+
- name: NEMOCLAW_PROVIDER
81+
value: "custom"
82+
- name: NEMOCLAW_ENDPOINT_URL
83+
value: "http://host.openshell.internal:8000/v1"
84+
- name: COMPATIBLE_API_KEY
85+
value: "dummy"
86+
- name: NEMOCLAW_MODEL
87+
value: "meta-llama/Llama-3.1-8B-Instruct"
88+
- name: NEMOCLAW_SANDBOX_NAME
89+
value: "my-assistant"
90+
- name: NEMOCLAW_POLICY_MODE
91+
value: "skip"
92+
volumeMounts:
93+
- name: docker-socket
94+
mountPath: /var/run
95+
- name: docker-config
96+
mountPath: /etc/docker
97+
resources:
98+
requests:
99+
memory: "4Gi"
100+
cpu: "2"
101+
102+
initContainers:
103+
# Configure Docker daemon for cgroup v2
104+
- name: init-docker-config
105+
image: busybox
106+
command: ["sh", "-c", "echo '{\"default-cgroupns-mode\":\"host\"}' > /etc/docker/daemon.json"]
107+
volumeMounts:
108+
- name: docker-config
109+
mountPath: /etc/docker
110+
111+
volumes:
112+
- name: docker-storage
113+
emptyDir: {}
114+
- name: docker-socket
115+
emptyDir: {}
116+
- name: docker-config
117+
emptyDir: {}
118+
119+
restartPolicy: Never

0 commit comments

Comments
 (0)