Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 205 additions & 0 deletions k8s/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
# NemoClaw on Kubernetes

> **⚠️ Experimental**: This deployment method is intended for **trying out NemoClaw on Kubernetes**, not for production use. It requires a **privileged pod** running **Docker-in-Docker (DinD)** to create isolated sandbox environments. Operational requirements (storage, runtime, security policies) vary by cluster configuration.

Run [NemoClaw](https://github.com/NVIDIA/NemoClaw) on Kubernetes with GPU inference powered by [Dynamo](https://github.com/ai-dynamo/dynamo) or any OpenAI-compatible endpoint.

---

## Quick Start

### Prerequisites

- Kubernetes cluster with `kubectl` access
- An OpenAI-compatible inference endpoint (Dynamo vLLM, vLLM, etc.)
- Permissions to create **privileged pods** (required for Docker-in-Docker)
- Sufficient node resources (~8GB memory, 2 CPUs for DinD container)

### 1. Deploy NemoClaw

```bash
kubectl create namespace nemoclaw
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/k8s/nemoclaw-k8s.yaml
```

### 2. Check Logs

```bash
kubectl logs -f nemoclaw -n nemoclaw -c workspace
```

Wait for "Onboard complete" message.

### 3. Connect to Your Sandbox

```bash
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
```

You're now inside a secure sandbox with an AI agent ready to help.

---

## Configuration

Edit the environment variables in `nemoclaw-k8s.yaml` before deploying:

| Variable | Required | Description |
|----------|----------|-------------|
| `DYNAMO_HOST` | Yes | Inference endpoint for socat proxy (e.g., `vllm-frontend.dynamo.svc:8000`) |
| `NEMOCLAW_ENDPOINT_URL` | Yes | URL the sandbox uses (usually `http://host.openshell.internal:8000/v1`) |
| `COMPATIBLE_API_KEY` | Yes | API key (use `dummy` for Dynamo/vLLM) |
| `NEMOCLAW_MODEL` | Yes | Model name (e.g., `meta-llama/Llama-3.1-8B-Instruct`) |
| `NEMOCLAW_SANDBOX_NAME` | No | Sandbox name (default: `my-assistant`) |

### Example: Custom Endpoint

```yaml
env:
- name: DYNAMO_HOST
value: "my-vllm.my-namespace.svc.cluster.local:8000"
- name: NEMOCLAW_ENDPOINT_URL
value: "http://host.openshell.internal:8000/v1"
- name: COMPATIBLE_API_KEY
value: "dummy"
- name: NEMOCLAW_MODEL
value: "mistralai/Mistral-7B-Instruct-v0.3"
```

---

## Using NemoClaw

### Access the Workspace Shell

```bash
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- bash
```

### Check Sandbox Status

```bash
kubectl exec nemoclaw -n nemoclaw -c workspace -- nemoclaw list
kubectl exec nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant status
```

### Connect to Sandbox

```bash
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
```

### Test Inference

From inside the sandbox:

```bash
curl -s https://inference.local/v1/models

curl -s https://inference.local/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
```

### Verify Local Inference

Confirm NemoClaw is using your Dynamo/vLLM endpoint:

```bash
# Check model from sandbox
kubectl exec -it nemoclaw -n nemoclaw -c workspace -- nemoclaw my-assistant connect
sandbox@my-assistant:~$ curl -s https://inference.local/v1/models
# Should show your model (e.g., meta-llama/Llama-3.1-8B-Instruct)

# Compare with Dynamo directly (from workspace)
kubectl exec nemoclaw -n nemoclaw -c workspace -- curl -s http://localhost:8000/v1/models
# Should show the same model

# Check provider configuration
kubectl exec nemoclaw -n nemoclaw -c workspace -- openshell inference get
# Shows: Provider: compatible-endpoint, Model: <your-model>

# Test the agent
sandbox@my-assistant:~$ openclaw agent --agent main -m "What is 7 times 8?"
# Should respond with 56
```

---

## Architecture

```text
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ NemoClaw Pod │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ Docker-in-Docker│ │ Workspace Container │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌───────────┐ │ │ nemoclaw CLI │ │ │
│ │ │ │ k3s │ │◄───│ openshell CLI │ │ │
│ │ │ │ cluster │ │ │ │ │ │
│ │ │ │ │ │ │ socat proxy ───────────────│───│──┼──► Dynamo/vLLM
│ │ │ │ ┌───────┐ │ │ │ localhost:8000 │ │ │
│ │ │ │ │Sandbox│ │ │ │ │ │ │
│ │ │ │ └───────┘ │ │ │ host.openshell.internal │ │ │
│ │ │ └───────────┘ │ │ routes to socat │ │ │
│ │ └─────────────────┘ └─────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```

**How it works:**

1. NemoClaw runs in a privileged pod with Docker-in-Docker
2. OpenShell creates a nested k3s cluster for sandbox isolation
3. A socat proxy bridges K8s DNS to the nested environment
4. Inside the sandbox, `host.openshell.internal:8000` routes to the inference endpoint

---

## Troubleshooting

### Pod won't start

```bash
kubectl describe pod nemoclaw -n nemoclaw
```

Common issues:

- Missing privileged security context
- Insufficient memory (needs ~8GB for DinD)

### Docker daemon not starting

```bash
kubectl logs nemoclaw -n nemoclaw -c dind
```

Usually resolves after 30-60 seconds.

### Inference not working

Check socat is running:

```bash
kubectl exec nemoclaw -n nemoclaw -c workspace -- pgrep -a socat
```

Test endpoint directly:

```bash
kubectl exec nemoclaw -n nemoclaw -c workspace -- curl -s http://localhost:8000/v1/models
```

---

## Learn More

- [NemoClaw Documentation](https://docs.nvidia.com/nemoclaw)
- [OpenShell](https://github.com/NVIDIA/OpenShell)
- [Dynamo](https://github.com/ai-dynamo/dynamo)
- [OpenClaw](https://openclaw.ai)
119 changes: 119 additions & 0 deletions k8s/nemoclaw-k8s.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# NemoClaw on Kubernetes
# Uses official installer with Docker-in-Docker for sandbox isolation.
# Prerequisites: kubectl create namespace nemoclaw
apiVersion: v1
kind: Pod
metadata:
name: nemoclaw
namespace: nemoclaw
labels:
app: nemoclaw
spec:
containers:
Comment on lines +11 to +12
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n k8s/nemoclaw-k8s.yaml

Repository: NVIDIA/NemoClaw

Length of output: 4492


Disable automatic service-account token mount for this pod.

The pod does not call the Kubernetes API—it only runs Docker-in-Docker and the NemoClaw installer. Mounting a token by default increases blast radius if compromised. Add automountServiceAccountToken: false at the spec level.

🔒 Proposed hardening
 spec:
+  automountServiceAccountToken: false
   containers:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
spec:
containers:
spec:
automountServiceAccountToken: false
containers:
🧰 Tools
🪛 Trivy (0.69.3)

[error] 11-119: Default security context configured

pod nemoclaw in nemoclaw namespace is using the default security context, which allows root privileges

Rule: KSV-0118

Learn more

(IaC/Kubernetes)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` around lines 11 - 12, Add
automountServiceAccountToken: false at the Pod spec level to disable automatic
mounting of the service account token; update the top-level spec (the same block
that contains containers) to include automountServiceAccountToken: false so the
pod running Docker-in-Docker and the NemoClaw installer will not receive a
mounted service-account token.

# Docker daemon (DinD)
- name: dind
image: docker:24-dind
securityContext:
privileged: true
env:
- name: DOCKER_TLS_CERTDIR
value: ""
command: ["dockerd", "--host=unix:///var/run/docker.sock"]
volumeMounts:
- name: docker-storage
mountPath: /var/lib/docker
- name: docker-socket
mountPath: /var/run
- name: docker-config
mountPath: /etc/docker
resources:
requests:
memory: "8Gi"
cpu: "2"

# Workspace - runs official NemoClaw installer
- name: workspace
image: node:22
command:
- bash
- -c
- |
set -e

# Install packages
echo "[1/4] Installing packages..."
apt-get update -qq
apt-get install -y -qq docker.io socat curl >/dev/null 2>&1

# Start socat proxy for K8s DNS bridge
echo "[2/4] Starting socat proxy..."
socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
# Add hosts entry so validation can reach socat via host.openshell.internal
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
sleep 1
Comment on lines +50 to +53
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -name "nemoclaw-k8s.yaml" -type f

Repository: NVIDIA/NemoClaw

Length of output: 82


🏁 Script executed:

cat -n ./k8s/nemoclaw-k8s.yaml | head -80

Repository: NVIDIA/NemoClaw

Length of output: 3131


🏁 Script executed:

cat -n ./k8s/nemoclaw-k8s.yaml | sed -n '40,70p'

Repository: NVIDIA/NemoClaw

Length of output: 1416


Fail fast if the socat bridge does not start.

At line 50, socat is backgrounded but its health is never validated. If it exits immediately, onboarding continues and fails later with less actionable errors. Note that set -e only applies to foreground commands, not background processes.

🛠️ Proposed reliability check
-          socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
+          : "${DYNAMO_HOST:?DYNAMO_HOST must be set as host:port}"
+          socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
+          SOCAT_PID=$!
           # Add hosts entry so validation can reach socat via host.openshell.internal
           echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
           sleep 1
+          kill -0 "$SOCAT_PID" 2>/dev/null || { echo "socat failed to start"; exit 1; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
# Add hosts entry so validation can reach socat via host.openshell.internal
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
sleep 1
: "${DYNAMO_HOST:?DYNAMO_HOST must be set as host:port}"
socat TCP-LISTEN:8000,fork,reuseaddr TCP:$DYNAMO_HOST &
SOCAT_PID=$!
# Add hosts entry so validation can reach socat via host.openshell.internal
echo "127.0.0.1 host.openshell.internal" >> /etc/hosts
sleep 1
kill -0 "$SOCAT_PID" 2>/dev/null || { echo "socat failed to start"; exit 1; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/nemoclaw-k8s.yaml` around lines 50 - 53, The socat bridge is started in
the background (the `socat TCP-LISTEN:8000,... &` line) but not validated;
modify this block to capture its PID ($!), wait briefly and verify it is
listening or still running (e.g., loop with `kill -0 $SOCAT_PID` or test port
8000 with `nc`/`ss`), and if it has exited or the port is not open within a
short timeout log an error and exit non‑zero so onboarding fails fast; do not
rely on `set -e` for background processes and keep the existing hosts entry
(`echo "127.0.0.1 host.openshell.internal" >> /etc/hosts`) logic.


# Wait for Docker
echo "[3/4] Waiting for Docker daemon..."
for i in $(seq 1 30); do
if docker info >/dev/null 2>&1; then break; fi
sleep 2
done
docker info >/dev/null 2>&1 || { echo "Docker not ready"; exit 1; }
echo "Docker ready"

# Run official NemoClaw installer
echo "[4/4] Running NemoClaw installer..."
curl -fsSL https://nvidia.com/nemoclaw.sh | bash

# Keep running after onboard
echo "Onboard complete. Container staying alive."
exec sleep infinity
env:
- name: DOCKER_HOST
value: unix:///var/run/docker.sock
# Dynamo endpoint (raw host:port for socat) - UPDATE THIS FOR YOUR CLUSTER
- name: DYNAMO_HOST
value: "vllm-agg-frontend.dynamo.svc.cluster.local:8000"
# NemoClaw config (uses host.openshell.internal via socat)
- name: NEMOCLAW_NON_INTERACTIVE
value: "1"
- name: NEMOCLAW_PROVIDER
value: "custom"
- name: NEMOCLAW_ENDPOINT_URL
value: "http://host.openshell.internal:8000/v1"
- name: COMPATIBLE_API_KEY
value: "dummy"
- name: NEMOCLAW_MODEL
value: "meta-llama/Llama-3.1-8B-Instruct"
- name: NEMOCLAW_SANDBOX_NAME
value: "my-assistant"
- name: NEMOCLAW_POLICY_MODE
value: "skip"
volumeMounts:
- name: docker-socket
mountPath: /var/run
- name: docker-config
mountPath: /etc/docker
resources:
requests:
memory: "4Gi"
cpu: "2"

initContainers:
# Configure Docker daemon for cgroup v2
- name: init-docker-config
image: busybox
command: ["sh", "-c", "echo '{\"default-cgroupns-mode\":\"host\"}' > /etc/docker/daemon.json"]
volumeMounts:
- name: docker-config
mountPath: /etc/docker

volumes:
- name: docker-storage
emptyDir: {}
- name: docker-socket
emptyDir: {}
- name: docker-config
emptyDir: {}

restartPolicy: Never
Loading