Skip to content

feat: Discord chat ingress (discordbot)#401

Draft
0xdiid wants to merge 18 commits into
paradigmxyz:api-rs-control-planefrom
0xSplits:feat/discord-chat-ingress
Draft

feat: Discord chat ingress (discordbot)#401
0xdiid wants to merge 18 commits into
paradigmxyz:api-rs-control-planefrom
0xSplits:feat/discord-chat-ingress

Conversation

@0xdiid
Copy link
Copy Markdown
Contributor

@0xdiid 0xdiid commented Jun 4, 2026

Adds Discord as a chat ingress for centaur. It's deliberately a close clone of slackbotv2 — same session lifecycle against the api-rs control plane, same streaming/rendering loop, same stream-safety behavior (bounded task payloads, tolerant context collection, execution-scoped event streams, Postgres resilience, plain-text render requests) — with Discord-shaped edges: a persistent Gateway connection instead of webhooks, a guild allowlist (the bot is inert until one is configured), auto-threading off the triggering message, an instant placeholder message while the sandbox spins up (where slackbotv2 instead waits for the first visible output), and thread-starter context — Discord keeps the message a thread was started from in the parent channel rather than in the thread itself, and webhook-style messages (Sentry alerts and the like) carry their payload in embeds, so both get fetched/flattened into the agent's context where Slack hands them over for free. The deliberate divergences are commented in-code at each site.

Non-Discord changes

  • api-rs sandboxes get tools + an overlay, via the CLI-shim model. Agents spawned by the Rust control plane get no deployment tools and no overlay without this — it affects every ingress riding api-rs, not just Discord. Rather than duplicating the Python control plane's tool-server sidecar, this finishes the last mile of the shim architecture the control plane already carries: two init containers copy the tool sources and overlay tree into the agent container, and the agent env gets the matching tool-directory paths so the existing shim installer turns them into CLIs at boot. Secrets were already handled — placeholder env vars with credential injection at the egress proxy, database access through proxied DSNs — so nothing real ever enters the sandbox, and warm-pool sandboxes get the same wiring through the shared spawn path. Everything is opt-in and additive (the config defaults to off, and the chart wires it through the existing toolServer/overlay values), so unset flags reproduce upstream behavior exactly. No new dependencies; needs a live re-verification on a cluster.
  • A network policy carve-out for direct HTTPS egress from the bot, because the Discord Gateway is an outbound websocket that can't go through the proxy. Scoped to the bot's pod only (health-probe ingress, control-plane + database + 443 egress), but worth a security look.
  • The bot duplicates a lot of slackbotv2 (session client, streaming, rendering) rather than sharing a package. We've been keeping them in sync by hand; factoring out the common core is probably the right long-term move.

🤖 Generated with Claude Code

@0xdiid 0xdiid force-pushed the feat/discord-chat-ingress branch 4 times, most recently from fc0e56f to a55a343 Compare June 5, 2026 17:48
0xdiid and others added 8 commits June 5, 2026 13:46
Rebased onto api-rs-control-plane. discordbot is a direct clone of the current
slackbotv2 — its own inlined session-api (no shared bridge package, to avoid
coupling to slackbotv2's churn) — pointing at the api-rs session control plane,
including the upstream handoff-before-ack flow, retryable SessionApiError
classification, execute idempotency keys, client_message_id append dedupe, and
render obligations with on-start recovery. Discord-specific pieces:
createDiscordAdapter, a long-lived Gateway controller, fail-closed guild
allowlist + DM-deny, native-thread naming, and a typing-indicator keepalive in
place of Slack status/title. Where slackbotv2 answers 503 to request a Slack
webhook retry, the Gateway has no re-delivery, so handoff failures render
in-thread. Carries the prior P1 fixes (concurrency:'drop', generic api-rs error
messages). 21 unit tests pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror the slackbotv2 deploy pattern for discordbot against the api-rs control
plane: Dockerfile (copies packages/ + .npmrc for the tslib hoist), Helm
discordbot.yaml (Deployment + Service, CENTAUR_API_URL -> api-rs:{apiRs.port},
replicas:1 + Recreate + 35s grace for the singleton Gateway session), a dedicated
NetworkPolicy (egress to api-rs + postgres + direct :443 for the Gateway, since
the cluster is default-deny), values + dev override (off by default, guild
allowlist required), Justfile _build-discordbot + import/ghcr wiring, and Discord
keys in the secrets bootstrap. CI matrix unchanged (slackbotv2/api-rs build locally
too).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ress

The discordbot (a clone of slackbotv2) forwards every message to api-rs
(CENTAUR_API_URL → centaur-centaur-api-rs:8080). Its own egress
NetworkPolicy already permits →api-rs, but the api-rs ingress `from`
list allowed only slackbotv2 and managed-by=api-rs sandboxes — not the
discordbot. On a cluster that enforces NetworkPolicy the forward is
rejected at connect time, so the bot can never reach the control plane.

Add a discordbot podSelector to the api-rs ingress, gated on
`.Values.discordbot.enabled` and mirroring the slackbotv2 block.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The discordbot (like slackbotv2) connects to Postgres directly for its
per-thread advisory locks / session state, and its egress NetworkPolicy
already permits →postgres:5432. But the postgres ingress `from` list
allowed api / api-rs / slackbotv2 / slackbot / iron-control — not the
discordbot — so on a NetworkPolicy-enforcing cluster the bot crash-looped
at startup with `ECONNREFUSED 5432`.

Add a discordbot podSelector to the postgres ingress, gated on
`.Values.discordbot.enabled` and mirroring the slackbotv2 block.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… max

LONG_RUNNING_MS was 365*24*60*60*1000 (31_536_000_000), passed as the
Gateway listener's self-destruct durationMs. The adapter backs it with a
single setTimeout, whose delay is a 32-bit signed int: any value above
2^31-1 ms silently clamps to 1ms (Node/Bun logs `TimeoutOverflowWarning
... set to 1`). The "stay connected" timer fired almost immediately — the
bot connected, logged "duration elapsed, disconnecting", and exited via
onFatalEnd, crash-looping the pod.

Cap at 2_147_483_647 (~24.8 days), the largest delay setTimeout can
represent. discord.js holds one session via RESUME within the window;
the timer then forces at most one re-IDENTIFY per ~24.8 days, well under
the 1000/24h budget. A re-arm loop is avoided: the adapter can't tell
timer-expiry from a fatal login error, so looping would mask a bad token
into an infinite reconnect.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chat SDK posts a bare "..." while the agent works before any streamed
content. Set ChatConfig.fallbackStreamingPlaceholderText to "✨ thinking..."
(overridable via DiscordbotOptions.streamingPlaceholderText).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stream

The placeholder only appeared after the ~9s cold sandbox spin-up because the
discordbot awaited the execute call (which blocks on sandbox+tool-server
readiness) before starting the render that posts the placeholder.

Do create+append only up front (fast), then run executeSession INSIDE the
render stream — after yielding the placeholder. The user now sees
"✨ thinking..." in ~0.3s while the sandbox spins up. executeSession is
idempotent (idempotency_key = message id), so a render retry won't re-spawn;
sandbox-spawn failures surface as an error in the same message (api-rs writes
no event if the spawn itself fails, so this avoids a hung placeholder). The
activeExecution guard is still set synchronously under the per-thread lock
before execute, so double-execution protection is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync pass against slackbotv2 @ e391f11 (3-way merge per file):
- Port paradigmxyz#416/paradigmxyz#418: discordSafeChatSdkStream omits task_update output and
  truncates details to 500 chars before streaming.
- Port the thread_not_found tolerance in collectInitialContext with a
  Discord-shaped guard (NetworkError carrying "Unknown Channel"/10003).

Deliberate deltas kept (commented in-code): the synthetic starting
notification and no streamAfterFirstChunk deferral — both serve the
instant "✨ thinking..." placeholder, where slackbotv2 instead posts
nothing until the first visible chunk (paradigmxyz#406/paradigmxyz#415).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@0xdiid 0xdiid force-pushed the feat/discord-chat-ingress branch from a55a343 to 79dd0e9 Compare June 5, 2026 19:48
0xdiid and others added 6 commits June 5, 2026 20:22
api-rs sandboxes had no tools and no overlay. Give api-rs-spawned agents the
same base + overlay tools and overlay system-prompt the chart already wires for
the api-rs pod, using upstream's CLI-shim tool model rather than a sidecar.

Upstream direction: tools are shell CLI shims, not an HTTP registry. The agent
image's install-tool-shims (services/sandbox/install_tool_shims.py) scans
TOOL_DIRS at entrypoint and `uvx`-installs each pyproject [project.scripts] as a
CLI; the SYSTEM_PROMPT points agents at those CLIs and `centaur-tools list`. The
old `call <tool>` HTTP registry is deprecated to control-plane-only. Tool
secrets are already handled upstream: codex_app_server_env_template pushes the
tool placeholder creds onto the agent env, iron-control grants the per-sandbox
principal the real secrets, and Postgres rides proxied `*_DSN` env from
apply_proxy_env. So the agent needs only the tool SOURCES at the right paths —
no sidecar, no HMAC sandbox token, no loopback tool server.

- tools.rs (replaces tool_server.rs): a `tools-bootstrap` init container copies
  /app/tools out of the shared centaur-api image into an emptyDir mounted at
  /app/tools in the agent, and an `overlay-bootstrap` init container copies the
  org overlay tree into overlay-root mounted at overlay.mountPath (the same path
  the api-rs Deployment uses) and stages the overlay's SYSTEM_PROMPT.md as
  $HOME/AGENTS_OVERLAY.md, which the sandbox entrypoint appends to the base
  prompt. TOOL_DIRS is set on the agent env to /app/tools (or
  /app/tools:<mountPath>/tools with the overlay) — identical to the value the
  api-rs pod computes for its own tool discovery, set deterministically in the
  spec builder rather than via passthrough env.
- lib.rs: build_agent_sandbox layers the tools/overlay env over spec.env, mounts
  the bootstrapped sources read-only into the agent, and appends the
  tools-bootstrap + overlay-bootstrap init containers and their volumes. No
  sidecar container, no token minting.
- args.rs: a minimal ToolsArgs (source image/pull-policy, reusing the
  KUBERNETES_TOOL_SERVER_IMAGE* env the chart sets from the shared api image) and
  OverlayArgs (image/pull-policy/source-path/mount-path) wired into
  AgentSandboxConfig. Explicit clap arg ids avoid id collisions with the other
  flattened arg structs.
- chart apirs.yaml: render the tools source image (api.image.*, gated on
  toolServer.enabled) and overlay (overlay.*) onto the api-rs env, replacing the
  KUBERNETES_TOOL_SERVER_* sidecar block.

Gone vs the sidecar port: tool_server.rs, the sbx1 HMAC token minting and its
SANDBOX_SIGNING_KEY requirement, CENTAUR_TOOLS_URL, the sidecar pg-DSN/proxy-env
collection, and the hmac/base64/sha2 dependency additions (nothing else in the
agent-k8s crate uses them).

Warm-pool sandboxes route through the same build_agent_sandbox path, so they get
the tools/overlay init containers and volumes for free.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eded

The .npmrc public-hoist-pattern (and discordbot's direct tslib dep) worked
around Bun failing to resolve tslib from discord.js under pnpm's strict
layout. On the current lockfile discord.js@14.26.4 declares tslib properly,
so pnpm nests a copy right next to it and Bun resolves it without help.
Verified both locally and in the production image (build + import discord.js
in the container) with the hoist and the direct dep removed.

Removes the only workspace-wide file the discordbot work touched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…igmxyz#422 port)

Sync pass against slackbotv2 @ 17882c4: thread executionId through
ForwardSessionInput into the events URL so streams only carry the turn
they're rendering (set in-stream after execute returns, and from the
stored obligation on recovery — the obligation already tracked it).
Also adopt the fixed-point truncation helper.

Not ported (deliberate delta): the oversized-render plain-text fallback —
the Discord adapter hard-truncates content at the 2000-char limit on every
outgoing payload, so Slack's msg_too_long failure mode cannot occur here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aradigmxyz#404 port)

Own the pg pool so an error handler can swallow idle-client drops (Postgres
restart / startup network races) instead of crashing the process, and block
obligation recovery on a backoff-retried first connect — same rationale as
upstream, Discord trace names.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync with slackbotv2 @ 02db3eb: command-execution tasks omit their
details from the stream entirely instead of truncating them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync with slackbotv2 @ f9fcb5e: messages asking for plain text ("plain
text only", "no interactive blocks", "no dashboards") drain the stream
silently and post one final text message — captured terminal result text
first, accumulated markdown as fallback. Discord-flavored: typing
keepalive instead of assistant status, and the final text pre-truncates
to fit Discord's 2000-char content cap with an honest suffix.

Brings in the render collector class this needed (the msg_too_long
fallback path it was built for remains unported — the Discord adapter
hard-truncates, so that failure mode can't occur).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@0xdiid 0xdiid force-pushed the feat/discord-chat-ingress branch from 79dd0e9 to a23bb65 Compare June 6, 2026 02:25
0xdiid and others added 4 commits June 5, 2026 21:52
…gent

A thread created from a message keeps that message in the parent channel
(the thread shares its ID), so the thread-history walk in
collectInitialContext never sees it — Slack has no analog because
conversations.replies returns the parent as the first reply. Fetch the
starter from the parent channel (mirroring discord.js
ThreadChannel#fetchStarterMessage) and prepend it to the initial context.

Webhook-style messages (Sentry alerts) carry their payload entirely in
embeds with empty content, which the chat adapter drops; flatten embed
text into the forwarded message so the agent can read them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tools-bootstrap init container mounted the tools emptyDir at
/app/tools — the same path it copies FROM. The mount shadows the source
image's tools tree, so the script self-copies the empty volume and GNU
cp rejects it (exit 1); every sandbox dies with 'reached terminal state
before running' and no agent ever starts.

Mount the volume at /tools-bootstrap instead (mirroring how
overlay-bootstrap stages to a distinct target) and copy the image's
/app/tools into it. The agent container keeps mounting the same volume
at /app/tools, so TOOL_DIRS and the shim installer are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sage

Discord's post+edit streaming dropped every task chunk, so runs showed a
bare placeholder that was later overwritten by the answer — no chain of
thought, and the answer sorted above messages sent mid-run. Runs now post
a progress message that's edited in place with a step timeline (reasoning
excerpts, commands, tool calls) and finalized as a permanent record, while
the answer streams into a separate message created on first visible text.
Drops the now-subsumed task-chunk stripping (the renderer truncates its
own previews instead).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The edited step timeline felt heavy-handed in practice. Replace it with
fully append-only narration: the triggering message gets an instant 👀
reaction (flipped to ✅/❌ on settle), and the agent's reasoning posts as
its own italic messages as each thought completes — commands and tools
render nothing. Reactions go through raw Discord REST since the adapter
can't reach thread-starter messages, which live in the parent channel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant