Skip to content

docs: add comprehensive architecture document#250

Open
hzxuzhonghu wants to merge 3 commits intovolcano-sh:mainfrom
hzxuzhonghu:add-architecture
Open

docs: add comprehensive architecture document#250
hzxuzhonghu wants to merge 3 commits intovolcano-sh:mainfrom
hzxuzhonghu:add-architecture

Conversation

@hzxuzhonghu
Copy link
Copy Markdown
Member

Summary

This PR replaces the existing overview.md with a comprehensive architecture.md that documents the full AgentCube system design based on a thorough review of the design proposals, codebase, and the official architecture diagram.

What's included

Coverage

  • System Overview — layered ASCII diagram showing Client/SDK → Data Plane → Control Plane → Session Store → Kubernetes API → Runtime Sandboxes (with all workload types)
  • Component deep-dives — Router, Workload Manager, PicoD, agentd, Session Store
  • CRD Hierarchy — all 6 CRDs (AgentRuntime, CodeInterpreter + 4 agent-sandbox CRDs) with relationship explanations
  • Key Flows — new session request (Mermaid sequence diagram), garbage collection paths
  • Sandbox State Machine — Mermaid stateDiagram-v2
  • Warm Pool Mechanism — CodeInterpreter pre-warming flow
  • Authentication Architecture — RSA key bootstrap → provisioning → JWT runtime verification
  • Supported Workload Types — table covering Custom AI Agent, Code Interpreter, MCP Server, Browser-Use, Computer-Use
  • Binary/Entry Point Summary, Package Dependency Map, Configuration Reference
  • Design Principles — 7 core principles including workload-agnostic sandboxes

Diagrams (Mermaid)

  • New session request sequence diagram
  • Sandbox state machine

Files changed

  • docs/agentcube/docs/architecture/architecture.md — new comprehensive architecture document (391 lines)
  • docs/agentcube/docs/architecture/overview.md — removed (content superseded)

Copilot AI review requested due to automatic review settings April 2, 2026 02:00
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive architecture document for AgentCube, detailing its split-plane design, core components, and sandbox lifecycle. The review identifies several inconsistencies between the new documentation and the existing codebase, specifically regarding supported HTTP methods, default workspace paths, and binary initialization logic. Additionally, it notes missing implementation for documented configuration variables and recommends enforcing request body limits in PicoD to align with the architectural specifications.

GET /health → Health check (no auth required)
```

**Security**: JWT verification using RSA public key injected via `PICOD_AUTH_PUBLIC_KEY` env var. Max body: 32 MB. Path traversal protection via `sanitizePath()`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation mentions a Max body: 32 MB limit for PicoD, but this limit is not explicitly enforced in pkg/picod/server.go. It is recommended to implement a request body limit (e.g., using http.MaxBytesReader) to match the architectural specification and prevent potential DoS attacks.

| `picod` | `cmd/picod/main.go` | `make build-picod` | In-sandbox daemon: execute + files |
| `agentd` | `cmd/agentd/main.go` | `make build-agentd` | Standalone session expiry cleanup |

All binaries use `controller-runtime` for Kubernetes integration and `signal.NotifyContext()` for graceful shutdown.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The statement that all binaries use controller-runtime and signal.NotifyContext() is inaccurate. For instance, cmd/router does not use controller-runtime, cmd/workload-manager uses signal.Notify instead of signal.NotifyContext, and cmd/picod does not implement explicit signal handling in its main.go.

Suggested change
All binaries use `controller-runtime` for Kubernetes integration and `signal.NotifyContext()` for graceful shutdown.
Binaries use controller-runtime or signal handling (e.g., signal.NotifyContext) for Kubernetes integration and graceful shutdown.

Comment on lines +338 to +339
| `WORKLOAD_MANAGER_ADDR` | (required) | Workload Manager service address |
| `AGENTCUBE_NAMESPACE` | `default` | Namespace for operations |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The configuration reference for the Router lists WORKLOAD_MANAGER_ADDR and AGENTCUBE_NAMESPACE as environment variables, but these are not currently implemented as flags or part of the router.Config struct in cmd/router/main.go.

| Env Var | Default | Description |
|---------|---------|-------------|
| `PORT` | `8080` | Listen port |
| `WORKSPACE` | `/workspace` | Working directory for file operations |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation states the default workspace is /workspace, but the implementation in pkg/picod/server.go (lines 58-65) defaults to the current working directory if the workspace is not specified.

Suggested change
| `WORKSPACE` | `/workspace` | Working directory for file operations |
| WORKSPACE | (current directory) | Working directory for file operations |

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 2, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.46%. Comparing base (845b798) to head (d2f5a98).
⚠️ Report is 151 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #250      +/-   ##
==========================================
+ Coverage   35.60%   43.46%   +7.85%     
==========================================
  Files          29       30       +1     
  Lines        2533     2616      +83     
==========================================
+ Hits          902     1137     +235     
+ Misses       1505     1358     -147     
+ Partials      126      121       -5     
Flag Coverage Δ
unittests 43.46% <100.00%> (+7.85%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Replaces the old, high-level architecture overview with a much more comprehensive system design document for AgentCube, covering the split-plane architecture, components, CRDs, key flows, and operational behavior.

Changes:

  • Removed overview.md and introduced a new long-form architecture.md.
  • Added detailed descriptions of Router, Workload Manager, PicoD, agentd, session store, CRD relationships, and key lifecycle flows.
  • Added Mermaid diagrams (sequence + state machine) and configuration/build/dependency reference sections.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File Description
docs/agentcube/docs/architecture/architecture.md New comprehensive architecture doc; several sections need alignment with current code (routes/methods, config flags vs env vars, build targets, deps, interfaces).
docs/agentcube/docs/architecture/overview.md Removed legacy overview content (superseded by architecture.md).

Comment on lines +66 to +73
**API Routes**:
```
POST /v1/namespaces/{ns}/agent-runtimes/{name}/invocations/*
POST /v1/namespaces/{ns}/code-interpreters/{name}/invocations/*
```

**Error Responses**: `400` invalid session | `429` concurrency limit | `502` sandbox unreachable | `504` timeout

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Router invocation endpoints support both GET and POST (GET is used for file download); the doc currently lists only POST. Also, an invalid/nonexistent session ID is returned as a 404 (NotFound) via api.NewSessionNotFoundError, not a 400. Please update the route list and error-response codes to match the router implementation (see pkg/router/server.go and pkg/api/errors.go).

Copilot uses AI. Check for mistakes.
PicoD verifies JWT signature using public key from env
```

**JWT Claims**: `session_id`, `sandbox_id`, `exp` (expiration), `iat` (issued at)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JWT claims list mentions sandbox_id, but Router currently only adds session_id as a custom claim when signing requests (plus standard exp, iat, iss). Either document only the claims that are actually present, or update the signer/verifier to include and validate sandbox_id if it's required for the design.

Suggested change
**JWT Claims**: `session_id`, `sandbox_id`, `exp` (expiration), `iat` (issued at)
**JWT Claims**: `session_id`, `exp` (expiration), `iat` (issued at), `iss` (issuer)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do not have sandbox_id

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, will remove

|--------|-------------|--------------|------|
| `workload-manager` | `cmd/workload-manager/main.go` | `make build` | Control plane: API server + reconcilers + GC |
| `router` | `cmd/router/main.go` | `make build-router` | Data plane: session routing + JWT + reverse proxy |
| `picod` | `cmd/picod/main.go` | `make build-picod` | In-sandbox daemon: execute + files |
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make build-picod is referenced as the PicoD build target, but the repo Makefile does not define a build-picod target (only docker-build-picod). Please update the table to the correct build target/command so readers can reproduce builds reliably.

Suggested change
| `picod` | `cmd/picod/main.go` | `make build-picod` | In-sandbox daemon: execute + files |
| `picod` | `cmd/picod/main.go` | `make docker-build-picod` | In-sandbox daemon: execute + files |

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to remove the build, that does not belong to arch

| `picod` | `cmd/picod/main.go` | `make build-picod` | In-sandbox daemon: execute + files |
| `agentd` | `cmd/agentd/main.go` | `make build-agentd` | Standalone session expiry cleanup |

All binaries use `controller-runtime` for Kubernetes integration and `signal.NotifyContext()` for graceful shutdown.
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Statement that all binaries use controller-runtime and signal.NotifyContext() is inaccurate: Router uses signal.NotifyContext, Workload Manager uses a manual signal channel, agentd uses ctrl.SetupSignalHandler(), and PicoD does not use controller-runtime. Please adjust this line to reflect the actual shutdown/integration patterns per binary.

Suggested change
All binaries use `controller-runtime` for Kubernetes integration and `signal.NotifyContext()` for graceful shutdown.
Router uses `signal.NotifyContext()` for graceful shutdown, Workload Manager integrates with Kubernetes via `controller-runtime` but uses a manual signal channel, `agentd` uses `controller-runtime`'s `ctrl.SetupSignalHandler()`, and `picod` is a standalone binary without `controller-runtime` integration.

Copilot uses AI. Check for mistakes.
cmd/router → pkg/router → pkg/store, pkg/common/types, pkg/api
cmd/workload-manager → pkg/workloadmanager → pkg/store, pkg/common/types, pkg/api, pkg/apis/runtime/v1alpha1
cmd/picod → pkg/picod → (standalone, no internal deps)
cmd/agentd → pkg/agentd → pkg/apis/runtime/v1alpha1
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package dependency map shows cmd/agentd → pkg/agentd → pkg/apis/runtime/v1alpha1, but agentd actually imports agent-sandbox types and pkg/workloadmanager (for the annotation key), not the runtime v1alpha1 API. Please update this mapping to match the current imports to avoid misleading maintainers.

Suggested change
cmd/agentd → pkg/agentd → pkg/apis/runtime/v1alpha1
cmd/agentd → pkg/agentd → agent-sandbox types, pkg/workloadmanager

Copilot uses AI. Check for mistakes.
Comment on lines +332 to +357
## Configuration Reference

### Router
| Env Var | Default | Description |
|---------|---------|-------------|
| `PORT` | `8080` | Listen port |
| `WORKLOAD_MANAGER_ADDR` | (required) | Workload Manager service address |
| `AGENTCUBE_NAMESPACE` | `default` | Namespace for operations |
| `MAX_CONCURRENT_REQUESTS` | `1000` | Concurrency limit |
| `ENABLE_TLS` | `false` | Enable TLS termination |

### Workload Manager
| Env Var | Default | Description |
|---------|---------|-------------|
| `PORT` | `8080` | Listen port |
| `RUNTIME_CLASS_NAME` | `kuasar-vmm` | Default RuntimeClassName for pods |
| `ENABLE_TLS` | `false` | Enable TLS |
| `ENABLE_AUTH` | `false` | Enable K8s auth forwarding |

### PicoD
| Env Var | Default | Description |
|---------|---------|-------------|
| `PORT` | `8080` | Listen port |
| `WORKSPACE` | `/workspace` | Working directory for file operations |
| `PICOD_AUTH_PUBLIC_KEY` | (injected) | PEM-encoded RSA public key for JWT verification |

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration is documented as environment variables for Router/Workload Manager/PicoD, but in the code these are primarily CLI flags (e.g., router --port, --enable-tls, --max-concurrent-requests; workload-manager --runtime-class-name, --enable-auth; picod --workspace). Also WORKSPACE env var is not used by PicoD. Please revise this section to distinguish real env vars (e.g., WORKLOAD_MANAGER_ADDR, AGENTCUBE_NAMESPACE, STORE_TYPE, REDIS_ADDR, etc.) from flags.

Copilot uses AI. Check for mistakes.
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from hzxuzhonghu. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

PicoD verifies JWT signature using public key from env
```

**JWT Claims**: `session_id`, `sandbox_id`, `exp` (expiration), `iat` (issued at)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do not have sandbox_id


**Binary**: `cmd/picod` | **Package**: `pkg/picod`

A lightweight HTTP daemon running inside every sandbox pod. It exposes APIs for command execution and file management.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only in CodeInterpreter with AuthModePicoD

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean i should mention the picod auth mode? I am thinking the auth may change

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If the auth will change, ignore it~

Signed-off-by: Zhonghu Xu <[email protected]>
- Implement 32 MB request body limit in picod via MaxBytesReader middleware
- Fix inaccurate signal handling claim: router uses signal.NotifyContext,
  agentd uses ctrl.SetupSignalHandler, workload-manager uses signal.Notify,
  picod has no signal handling; only agentd and workload-manager use controller-runtime
- Remove non-existent AGENTCUBE_NAMESPACE from router config table; add
  actual flags (tls-cert, tls-key, debug); clarify WORKLOAD_MANAGER_ADDR is env var
- Fix PicoD WORKSPACE default from /workspace to current directory

Signed-off-by: Zhonghu Xu <[email protected]>
Signed-off-by: Zhonghu Xu <[email protected]>
Copilot AI review requested due to automatic review settings April 7, 2026 01:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants