feat: Discord chat ingress (discordbot)#401
Draft
0xdiid wants to merge 18 commits into
Draft
Conversation
fc0e56f to
a55a343
Compare
Rebased onto api-rs-control-plane. discordbot is a direct clone of the current slackbotv2 — its own inlined session-api (no shared bridge package, to avoid coupling to slackbotv2's churn) — pointing at the api-rs session control plane, including the upstream handoff-before-ack flow, retryable SessionApiError classification, execute idempotency keys, client_message_id append dedupe, and render obligations with on-start recovery. Discord-specific pieces: createDiscordAdapter, a long-lived Gateway controller, fail-closed guild allowlist + DM-deny, native-thread naming, and a typing-indicator keepalive in place of Slack status/title. Where slackbotv2 answers 503 to request a Slack webhook retry, the Gateway has no re-delivery, so handoff failures render in-thread. Carries the prior P1 fixes (concurrency:'drop', generic api-rs error messages). 21 unit tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Mirror the slackbotv2 deploy pattern for discordbot against the api-rs control
plane: Dockerfile (copies packages/ + .npmrc for the tslib hoist), Helm
discordbot.yaml (Deployment + Service, CENTAUR_API_URL -> api-rs:{apiRs.port},
replicas:1 + Recreate + 35s grace for the singleton Gateway session), a dedicated
NetworkPolicy (egress to api-rs + postgres + direct :443 for the Gateway, since
the cluster is default-deny), values + dev override (off by default, guild
allowlist required), Justfile _build-discordbot + import/ghcr wiring, and Discord
keys in the secrets bootstrap. CI matrix unchanged (slackbotv2/api-rs build locally
too).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ress The discordbot (a clone of slackbotv2) forwards every message to api-rs (CENTAUR_API_URL → centaur-centaur-api-rs:8080). Its own egress NetworkPolicy already permits →api-rs, but the api-rs ingress `from` list allowed only slackbotv2 and managed-by=api-rs sandboxes — not the discordbot. On a cluster that enforces NetworkPolicy the forward is rejected at connect time, so the bot can never reach the control plane. Add a discordbot podSelector to the api-rs ingress, gated on `.Values.discordbot.enabled` and mirroring the slackbotv2 block. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The discordbot (like slackbotv2) connects to Postgres directly for its per-thread advisory locks / session state, and its egress NetworkPolicy already permits →postgres:5432. But the postgres ingress `from` list allowed api / api-rs / slackbotv2 / slackbot / iron-control — not the discordbot — so on a NetworkPolicy-enforcing cluster the bot crash-looped at startup with `ECONNREFUSED 5432`. Add a discordbot podSelector to the postgres ingress, gated on `.Values.discordbot.enabled` and mirroring the slackbotv2 block. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… max LONG_RUNNING_MS was 365*24*60*60*1000 (31_536_000_000), passed as the Gateway listener's self-destruct durationMs. The adapter backs it with a single setTimeout, whose delay is a 32-bit signed int: any value above 2^31-1 ms silently clamps to 1ms (Node/Bun logs `TimeoutOverflowWarning ... set to 1`). The "stay connected" timer fired almost immediately — the bot connected, logged "duration elapsed, disconnecting", and exited via onFatalEnd, crash-looping the pod. Cap at 2_147_483_647 (~24.8 days), the largest delay setTimeout can represent. discord.js holds one session via RESUME within the window; the timer then forces at most one re-IDENTIFY per ~24.8 days, well under the 1000/24h budget. A re-arm loop is avoided: the adapter can't tell timer-expiry from a fatal login error, so looping would mask a bad token into an infinite reconnect. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chat SDK posts a bare "..." while the agent works before any streamed content. Set ChatConfig.fallbackStreamingPlaceholderText to "✨ thinking..." (overridable via DiscordbotOptions.streamingPlaceholderText). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stream The placeholder only appeared after the ~9s cold sandbox spin-up because the discordbot awaited the execute call (which blocks on sandbox+tool-server readiness) before starting the render that posts the placeholder. Do create+append only up front (fast), then run executeSession INSIDE the render stream — after yielding the placeholder. The user now sees "✨ thinking..." in ~0.3s while the sandbox spins up. executeSession is idempotent (idempotency_key = message id), so a render retry won't re-spawn; sandbox-spawn failures surface as an error in the same message (api-rs writes no event if the spawn itself fails, so this avoids a hung placeholder). The activeExecution guard is still set synchronously under the per-thread lock before execute, so double-execution protection is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync pass against slackbotv2 @ e391f11 (3-way merge per file): - Port paradigmxyz#416/paradigmxyz#418: discordSafeChatSdkStream omits task_update output and truncates details to 500 chars before streaming. - Port the thread_not_found tolerance in collectInitialContext with a Discord-shaped guard (NetworkError carrying "Unknown Channel"/10003). Deliberate deltas kept (commented in-code): the synthetic starting notification and no streamAfterFirstChunk deferral — both serve the instant "✨ thinking..." placeholder, where slackbotv2 instead posts nothing until the first visible chunk (paradigmxyz#406/paradigmxyz#415). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a55a343 to
79dd0e9
Compare
api-rs sandboxes had no tools and no overlay. Give api-rs-spawned agents the same base + overlay tools and overlay system-prompt the chart already wires for the api-rs pod, using upstream's CLI-shim tool model rather than a sidecar. Upstream direction: tools are shell CLI shims, not an HTTP registry. The agent image's install-tool-shims (services/sandbox/install_tool_shims.py) scans TOOL_DIRS at entrypoint and `uvx`-installs each pyproject [project.scripts] as a CLI; the SYSTEM_PROMPT points agents at those CLIs and `centaur-tools list`. The old `call <tool>` HTTP registry is deprecated to control-plane-only. Tool secrets are already handled upstream: codex_app_server_env_template pushes the tool placeholder creds onto the agent env, iron-control grants the per-sandbox principal the real secrets, and Postgres rides proxied `*_DSN` env from apply_proxy_env. So the agent needs only the tool SOURCES at the right paths — no sidecar, no HMAC sandbox token, no loopback tool server. - tools.rs (replaces tool_server.rs): a `tools-bootstrap` init container copies /app/tools out of the shared centaur-api image into an emptyDir mounted at /app/tools in the agent, and an `overlay-bootstrap` init container copies the org overlay tree into overlay-root mounted at overlay.mountPath (the same path the api-rs Deployment uses) and stages the overlay's SYSTEM_PROMPT.md as $HOME/AGENTS_OVERLAY.md, which the sandbox entrypoint appends to the base prompt. TOOL_DIRS is set on the agent env to /app/tools (or /app/tools:<mountPath>/tools with the overlay) — identical to the value the api-rs pod computes for its own tool discovery, set deterministically in the spec builder rather than via passthrough env. - lib.rs: build_agent_sandbox layers the tools/overlay env over spec.env, mounts the bootstrapped sources read-only into the agent, and appends the tools-bootstrap + overlay-bootstrap init containers and their volumes. No sidecar container, no token minting. - args.rs: a minimal ToolsArgs (source image/pull-policy, reusing the KUBERNETES_TOOL_SERVER_IMAGE* env the chart sets from the shared api image) and OverlayArgs (image/pull-policy/source-path/mount-path) wired into AgentSandboxConfig. Explicit clap arg ids avoid id collisions with the other flattened arg structs. - chart apirs.yaml: render the tools source image (api.image.*, gated on toolServer.enabled) and overlay (overlay.*) onto the api-rs env, replacing the KUBERNETES_TOOL_SERVER_* sidecar block. Gone vs the sidecar port: tool_server.rs, the sbx1 HMAC token minting and its SANDBOX_SIGNING_KEY requirement, CENTAUR_TOOLS_URL, the sidecar pg-DSN/proxy-env collection, and the hmac/base64/sha2 dependency additions (nothing else in the agent-k8s crate uses them). Warm-pool sandboxes route through the same build_agent_sandbox path, so they get the tools/overlay init containers and volumes for free. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eded The .npmrc public-hoist-pattern (and discordbot's direct tslib dep) worked around Bun failing to resolve tslib from discord.js under pnpm's strict layout. On the current lockfile discord.js@14.26.4 declares tslib properly, so pnpm nests a copy right next to it and Bun resolves it without help. Verified both locally and in the production image (build + import discord.js in the container) with the hoist and the direct dep removed. Removes the only workspace-wide file the discordbot work touched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…igmxyz#422 port) Sync pass against slackbotv2 @ 17882c4: thread executionId through ForwardSessionInput into the events URL so streams only carry the turn they're rendering (set in-stream after execute returns, and from the stored obligation on recovery — the obligation already tracked it). Also adopt the fixed-point truncation helper. Not ported (deliberate delta): the oversized-render plain-text fallback — the Discord adapter hard-truncates content at the 2000-char limit on every outgoing payload, so Slack's msg_too_long failure mode cannot occur here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aradigmxyz#404 port) Own the pg pool so an error handler can swallow idle-client drops (Postgres restart / startup network races) instead of crashing the process, and block obligation recovery on a backoff-retried first connect — same rationale as upstream, Discord trace names. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync with slackbotv2 @ 02db3eb: command-execution tasks omit their details from the stream entirely instead of truncating them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sync with slackbotv2 @ f9fcb5e: messages asking for plain text ("plain text only", "no interactive blocks", "no dashboards") drain the stream silently and post one final text message — captured terminal result text first, accumulated markdown as fallback. Discord-flavored: typing keepalive instead of assistant status, and the final text pre-truncates to fit Discord's 2000-char content cap with an honest suffix. Brings in the render collector class this needed (the msg_too_long fallback path it was built for remains unported — the Discord adapter hard-truncates, so that failure mode can't occur). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
79dd0e9 to
a23bb65
Compare
…gent A thread created from a message keeps that message in the parent channel (the thread shares its ID), so the thread-history walk in collectInitialContext never sees it — Slack has no analog because conversations.replies returns the parent as the first reply. Fetch the starter from the parent channel (mirroring discord.js ThreadChannel#fetchStarterMessage) and prepend it to the initial context. Webhook-style messages (Sentry alerts) carry their payload entirely in embeds with empty content, which the chat adapter drops; flatten embed text into the forwarded message so the agent can read them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The tools-bootstrap init container mounted the tools emptyDir at /app/tools — the same path it copies FROM. The mount shadows the source image's tools tree, so the script self-copies the empty volume and GNU cp rejects it (exit 1); every sandbox dies with 'reached terminal state before running' and no agent ever starts. Mount the volume at /tools-bootstrap instead (mirroring how overlay-bootstrap stages to a distinct target) and copy the image's /app/tools into it. The agent container keeps mounting the same volume at /app/tools, so TOOL_DIRS and the shim installer are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sage Discord's post+edit streaming dropped every task chunk, so runs showed a bare placeholder that was later overwritten by the answer — no chain of thought, and the answer sorted above messages sent mid-run. Runs now post a progress message that's edited in place with a step timeline (reasoning excerpts, commands, tool calls) and finalized as a permanent record, while the answer streams into a separate message created on first visible text. Drops the now-subsumed task-chunk stripping (the renderer truncates its own previews instead). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The edited step timeline felt heavy-handed in practice. Replace it with fully append-only narration: the triggering message gets an instant 👀 reaction (flipped to ✅/❌ on settle), and the agent's reasoning posts as its own italic messages as each thought completes — commands and tools render nothing. Reactions go through raw Discord REST since the adapter can't reach thread-starter messages, which live in the parent channel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds Discord as a chat ingress for centaur. It's deliberately a close clone of slackbotv2 — same session lifecycle against the api-rs control plane, same streaming/rendering loop, same stream-safety behavior (bounded task payloads, tolerant context collection, execution-scoped event streams, Postgres resilience, plain-text render requests) — with Discord-shaped edges: a persistent Gateway connection instead of webhooks, a guild allowlist (the bot is inert until one is configured), auto-threading off the triggering message, an instant placeholder message while the sandbox spins up (where slackbotv2 instead waits for the first visible output), and thread-starter context — Discord keeps the message a thread was started from in the parent channel rather than in the thread itself, and webhook-style messages (Sentry alerts and the like) carry their payload in embeds, so both get fetched/flattened into the agent's context where Slack hands them over for free. The deliberate divergences are commented in-code at each site.
Non-Discord changes
🤖 Generated with Claude Code