feat: add Kubernetes testing infrastructure (#227)

rwipfelnv · claude · cv · web-flow · commit 3c7bd9341568 · 2026-03-30T09:52:28.000-04:00
* feat: add Kubernetes testing infrastructure Add k8s-testing/ directory with scripts and manifests for testing NemoClaw on Kubernetes with Dynamo vLLM inference. Includes: - test-installer.sh: Public installer test (requires unattended install support) - setup.sh: Manual setup from source for development - Pod manifests for Docker-in-Docker execution Architecture: OpenShell runs k3s inside Docker, so we use DinD pods to provide the Docker daemon on Kubernetes. Signed-off-by: rwipfelnv * fix: add socat proxy for K8s DNS isolation workaround OpenShell's nested k3s cluster cannot resolve Kubernetes DNS names, so inference requests fail with 502 Bad Gateway. This adds: - socat TCP proxy setup in setup.sh to forward localhost:8000 to the K8s vLLM service endpoint - Provider configuration using host.openshell.internal:8000 which resolves to the workspace container from inside k3s - Documentation explaining the network architecture and workaround - Updated env var names to match PR #318 (NEMOCLAW_NON_INTERACTIVE) - cgroup v2 compatibility fix for Docker daemon - Removed memory limits that caused OOM Tested: Inference requests from sandboxes now route correctly through the socat proxy to the Dynamo vLLM endpoint. Depends on: #318 (non-interactive mode), #365 (Dynamo provider) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: NemoKlaw - NemoClaw on Kubernetes with Dynamo support Complete K8s deployment solution for NemoClaw: - nemoklaw.yaml: Pod manifest with DinD, init containers, hostPath storage - install.sh: Interactive installer with preflight checks - Rename k8s-testing -> k8s, move old files to dev/ Key learnings: - hostPath storage (/mnt/k8s-disks) avoids ephemeral storage eviction - Init containers for docker config, openshell CLI, NemoClaw build - Workspace container installs apt packages at runtime (can't share via volumes) - socat proxy bridges K8s DNS to nested k3s (host.openshell.internal) Tested successfully with Dynamo vLLM backend on EKS. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * fix: rename NemoKlaw to NemoClaw and document known limitations Address PR feedback: - Rename NemoKlaw -> NemoClaw (avoid confusing naming) - Rename nemoklaw.yaml -> nemoclaw-k8s.yaml - Fix hardcoded endpoint to use generic example - Remove log file from repo - Document known limitations (HTTPS proxy issue) - Update README with accurate status of what works/doesn't work Signed-off-by: rwipfelnv Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: update DYNAMO_HOST to vllm-agg-frontend The aggregated frontend service is the correct endpoint for Dynamo vLLM inference. Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * docs: add Using NemoClaw section with CLI commands - Add workspace shell access command - Add sandbox status/logs/list commands - Add chat completion test example - Rename section from "What Can You Do?" to "Using NemoClaw" Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> * refactor(k8s): simplify deployment to use official installer - Use official NemoClaw installer (`curl | bash`) instead of git clone/build - Switch to `custom` provider from PR #648 (supersedes dynamo-specific provider) - Remove k8s/dev/ directory (no longer needed for testing) - Use emptyDir volumes for portability across clusters - Add /etc/hosts workaround for endpoint validation during onboarding - Update README with verification steps for local inference Tested end-to-end with Dynamo vLLM backend. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(k8s): resolve lint errors in yaml and markdown - Remove multi-document YAML (move namespace creation to README) - Add language specifier to fenced code block (```text) - Add blank lines before lists per markdownlint rules Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> * docs(k8s): add experimental warning and clarify requirements - Add explicit experimental warning at top of README - Clarify this is for trying NemoClaw on k8s, not production - Document privileged pod and DinD requirements upfront - Add resource requirements to prerequisites Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> --------- Signed-off-by: rwipfelnv Signed-off-by: Robert Wipfel <rwipfel@nvidia.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: KJ <kejones@nvidia.com>
diff --git a/k8s/README.md b/k8s/README.md
@@ -0,0 +1,205 @@
+# NemoClaw on Kubernetes
+
+> **⚠️ Experimental**: This deployment method is intended for **trying out NemoClaw on Kubernetes**, not for production use. It requires a **privileged pod** running **Docker-in-Docker (DinD)** to create isolated sandbox environments. Operational requirements (storage, runtime, security policies) vary by cluster configuration.
+
+Run [NemoClaw](https://github.com/NVIDIA/NemoClaw) on Kubernetes with GPU inference powered by [Dynamo](https://github.com/ai-dynamo/dynamo) or any OpenAI-compatible endpoint.
+
+---
+
+## Quick Start
+
+### Prerequisites
+
+- Kubernetes cluster with `kubectl` access
+- An OpenAI-compatible inference endpoint (Dynamo vLLM, vLLM, etc.)
+- Permissions to create **privileged pods** (required for Docker-in-Docker)
+- Sufficient node resources (~8GB memory, 2 CPUs for DinD container)
+
+### 1. Deploy NemoClaw
+
+```bash
+kubectl create namespace nemoclaw
+kubectl apply -f https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/k8s/nemoclaw-k8s.yaml
+```
+
+### 2. Check Logs
+
+```bash
+kubectl logs -f nemoclaw -n nemoclaw -c workspace
+```
+
+Wait for "Onboard complete" message.
+
+### 3. Connect to Your Sandbox
+
+```bash
+kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
+```
+
+You're now inside a secure sandbox with an AI agent ready to help.
+
+---
+
+## Configuration
+
+Edit the environment variables in `nemoclaw-k8s.yaml` before deploying:
+
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `DYNAMO_HOST` | Yes | Inference endpoint for socat proxy (e.g., `vllm-frontend.dynamo.svc:8000`) |
+| `NEMOCLAW_ENDPOINT_URL` | Yes | URL the sandbox uses (usually `http://host.openshell.internal:8000/v1`) |
+| `COMPATIBLE_API_KEY` | Yes | API key (use `dummy` for Dynamo/vLLM) |
+| `NEMOCLAW_MODEL` | Yes | Model name (e.g., `meta-llama/Llama-3.1-8B-Instruct`) |
+| `NEMOCLAW_SANDBOX_NAME` | No | Sandbox name (default: `my-assistant`) |
+
+### Example: Custom Endpoint
+
+```yaml
+env:
+  - name: DYNAMO_HOST
+    value: "my-vllm.my-namespace.svc.cluster.local:8000"
+  - name: NEMOCLAW_ENDPOINT_URL
+    value: "http://host.openshell.internal:8000/v1"
+  - name: COMPATIBLE_API_KEY
+    value: "dummy"
+  - name: NEMOCLAW_MODEL
+    value: "mistralai/Mistral-7B-Instruct-v0.3"
+```
+
+---
+
+## Using NemoClaw
+
+### Access the Workspace Shell
+
+```bash
+kubectl exec -it nemoclaw -n nemoclaw -c workspace -- bash
+```
+
+### Check Sandbox Status
+
+```bash
+kubectl exec nemoclaw -n nemoclaw -c workspace -- nemoclaw list
+kubectl exec nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant status
+```
+
+### Connect to Sandbox
+
+```bash
+kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
+```
+
+### Test Inference
+
+From inside the sandbox:
+
+```bash
+curl -s https://inference.local/v1/models
+
+curl -s https://inference.local/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
+```
+
+### Verify Local Inference
+
+Confirm NemoClaw is using your Dynamo/vLLM endpoint:
+
+```bash
+# Check model from sandbox
+kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
+sandbox@my-assistant:~$ curl -s https://inference.local/v1/models
+# Should show your model (e.g., meta-llama/Llama-3.1-8B-Instruct)
+
+# Compare with Dynamo directly (from workspace)
+kubectl exec nemoclaw -n nemoclaw -c workspace -- curl -s http://localhost:8000/v1/models
+# Should show the same model
+
+# Check provider configuration
+kubectl exec nemoclaw -n nemoclaw -c workspace -- openshell inference get
+# Shows: Provider: compatible-endpoint, Model: <your-model>
+
+# Test the agent
+sandbox@my-assistant:~$ openclaw agent --agent main -m "What is 7 times 8?"
+# Should respond with 56
+```
+
+---
+
+## Architecture
+
+```text
+┌─────────────────────────────────────────────────────────────────┐
+│                     Kubernetes Cluster                          │
+│                                                                 │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │                    NemoClaw Pod                           │  │
+│  │                                                           │  │
+│  │  ┌─────────────────┐    ┌─────────────────────────────┐   │  │
+│  │  │ Docker-in-Docker│    │    Workspace Container      │   │  │
+│  │  │                 │    │                             │   │  │
+│  │  │  ┌───────────┐  │    │  nemoclaw CLI               │   │  │
+│  │  │  │    k3s    │  │◄───│  openshell CLI              │   │  │
+│  │  │  │  cluster  │  │    │                             │   │  │
+│  │  │  │           │  │    │  socat proxy ───────────────│───│──┼──► Dynamo/vLLM
+│  │  │  │ ┌───────┐ │  │    │  localhost:8000             │   │  │
+│  │  │  │ │Sandbox│ │  │    │                             │   │  │
+│  │  │  │ └───────┘ │  │    │  host.openshell.internal    │   │  │
+│  │  │  └───────────┘  │    │  routes to socat            │   │  │
+│  │  └─────────────────┘    └─────────────────────────────┘   │  │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+**How it works:**
+
+1. NemoClaw runs in a privileged pod with Docker-in-Docker
+2. OpenShell creates a nested k3s cluster for sandbox isolation
+3. A socat proxy bridges K8s DNS to the nested environment
+4. Inside the sandbox, `host.openshell.internal:8000` routes to the inference endpoint
+
+---
+
+## Troubleshooting
+
+### Pod won't start
+
+```bash
+kubectl describe pod nemoclaw -n nemoclaw
+```
+
+Common issues:
+
+- Missing privileged security context
+- Insufficient memory (needs ~8GB for DinD)
+
+### Docker daemon not starting
+
+```bash
+kubectl logs nemoclaw -n nemoclaw -c dind
+```
+
+Usually resolves after 30-60 seconds.
+
+### Inference not working
+
+Check socat is running:
+
+```bash
+kubectl exec nemoclaw -n nemoclaw -c workspace -- pgrep -a socat
+```
+
+Test endpoint directly:
+
+```bash
+kubectl exec nemoclaw -n nemoclaw -c workspace -- curl -s http://localhost:8000/v1/models
+```
+
+---
+
+## Learn More
+
+- [NemoClaw Documentation](https://docs.nvidia.com/nemoclaw)
+- [OpenShell](https://github.com/NVIDIA/OpenShell)
+- [Dynamo](https://github.com/ai-dynamo/dynamo)
+- [OpenClaw](https://openclaw.ai)
diff --git a/k8s/nemoclaw-k8s.yaml b/k8s/nemoclaw-k8s.yaml
@@ -0,0 +1,119 @@
+# NemoClaw on Kubernetes
+# Uses official installer with Docker-in-Docker for sandbox isolation.
+# Prerequisites: kubectl create namespace nemoclaw
+apiVersion: v1
+kind: Pod
+metadata:
+  name: nemoclaw
+  namespace: nemoclaw
+  labels:
+    app: nemoclaw
+spec:
+  containers:
+    # Docker daemon (DinD)
+    - name: dind
+      image: docker:24-dind
+      securityContext:
+        privileged: true
+      env:
+        - name: DOCKER_TLS_CERTDIR
+          value: ""
+      command: ["dockerd", "--host=unix:///var/run/docker.sock"]
+      volumeMounts:
+        - name: docker-storage
+          mountPath: /var/lib/docker
+        - name: docker-socket
+          mountPath: /var/run
+        - name: docker-config
+          mountPath: /etc/docker
+      resources:
+        requests:
+          memory: "8Gi"
+          cpu: "2"
+
+    # Workspace - runs official NemoClaw installer
+    - name: workspace
+      image: node:22
+      command:
+        - bash
+        - -c
+        - |
+          set -e
+
+          # Install packages
+          echo "[1/4] Installing packages..."
+          apt-get update -qq
+          apt-get install -y -qq docker.io socat curl >/dev/null 2>&1
+
+          # Start socat proxy for K8s DNS bridge
+          echo "[2/4] Starting socat proxy..."
+          socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
+          # Add hosts entry so validation can reach socat via host.openshell.internal
+          echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
+          sleep 1
+
+          # Wait for Docker
+          echo "[3/4] Waiting for Docker daemon..."
+          for i in $(seq 1 30); do
+            if docker info >/dev/null 2>&1; then break; fi
+            sleep 2
+          done
+          docker info >/dev/null 2>&1 || { echo "Docker not ready"; exit 1; }
+          echo "Docker ready"
+
+          # Run official NemoClaw installer
+          echo "[4/4] Running NemoClaw installer..."
+          curl -fsSL https://nvidia.com/nemoclaw.sh | bash
+
+          # Keep running after onboard
+          echo "Onboard complete. Container staying alive."
+          exec sleep infinity
+      env:
+        - name: DOCKER_HOST
+          value: unix:///var/run/docker.sock
+        # Dynamo endpoint (raw host:port for socat) - UPDATE THIS FOR YOUR CLUSTER
+        - name: DYNAMO_HOST
+          value: "vllm-agg-frontend.dynamo.svc.cluster.local:8000"
+        # NemoClaw config (uses host.openshell.internal via socat)
+        - name: NEMOCLAW_NON_INTERACTIVE
+          value: "1"
+        - name: NEMOCLAW_PROVIDER
+          value: "custom"
+        - name: NEMOCLAW_ENDPOINT_URL
+          value: "http://host.openshell.internal:8000/v1"
+        - name: COMPATIBLE_API_KEY
+          value: "dummy"
+        - name: NEMOCLAW_MODEL
+          value: "meta-llama/Llama-3.1-8B-Instruct"
+        - name: NEMOCLAW_SANDBOX_NAME
+          value: "my-assistant"
+        - name: NEMOCLAW_POLICY_MODE
+          value: "skip"
+      volumeMounts:
+        - name: docker-socket
+          mountPath: /var/run
+        - name: docker-config
+          mountPath: /etc/docker
+      resources:
+        requests:
+          memory: "4Gi"
+          cpu: "2"
+
+  initContainers:
+    # Configure Docker daemon for cgroup v2
+    - name: init-docker-config
+      image: busybox
+      command: ["sh", "-c", "echo '{\"default-cgroupns-mode\":\"host\"}' > /etc/docker/daemon.json"]
+      volumeMounts:
+        - name: docker-config
+          mountPath: /etc/docker
+
+  volumes:
+    - name: docker-storage
+      emptyDir: {}
+    - name: docker-socket
+      emptyDir: {}
+    - name: docker-config
+      emptyDir: {}
+
+  restartPolicy: Never