feat(seer): Shard night shift triage into per-chunk feature runs#118080
Conversation
A night shift run dispatched all scored candidates to a single Seer feature run. Large candidate sets degrade a single triage agent (limited time and context), so split candidates into chunks of seer.night_shift.shard_size (default 5) and dispatch each chunk as its own feature run / SeerRun. Each shard is recorded as a new SeerNightShiftRunShard, making the run -> SeerRun relationship one-to-many. Result delivery resolves the run via a shard's SeerRun uuid, falling back to the legacy scalar seer_run FK for runs created before sharding. The legacy FK still points at the first shard during the transition; it is backfilled and dropped in follow-up PRs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The shard model is generic to any night shift workflow, not just triage. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
This PR has a migration; here is the generated SQL for for --
-- Create model SeerNightShiftRunShard
--
CREATE TABLE "seer_nightshiftrunshard" ("id" bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, "date_updated" timestamp with time zone NOT NULL, "date_added" timestamp with time zone NOT NULL, "extras" jsonb DEFAULT '{}'::jsonb NOT NULL, "run_id" bigint NOT NULL, "seer_run_id" bigint NULL UNIQUE);
ALTER TABLE "seer_nightshiftrunshard" ADD CONSTRAINT "seer_nightshiftrunsh_run_id_574f3dc7_fk_seer_nigh" FOREIGN KEY ("run_id") REFERENCES "seer_nightshiftrun" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "seer_nightshiftrunshard" VALIDATE CONSTRAINT "seer_nightshiftrunsh_run_id_574f3dc7_fk_seer_nigh";
ALTER TABLE "seer_nightshiftrunshard" ADD CONSTRAINT "seer_nightshiftrunshard_seer_run_id_aa8616a4_fk_seer_seerrun_id" FOREIGN KEY ("seer_run_id") REFERENCES "seer_seerrun" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "seer_nightshiftrunshard" VALIDATE CONSTRAINT "seer_nightshiftrunshard_seer_run_id_aa8616a4_fk_seer_seerrun_id";
CREATE INDEX CONCURRENTLY "seer_nightshiftrunshard_run_id_574f3dc7" ON "seer_nightshiftrunshard" ("run_id"); |
Each SeerRun is dispatched for exactly one shard, so model the link as one-to-one to enforce the invariant at the DB level. Keep SET_NULL: the SeerRun is a mirror the shard references, not its owner, and gets TTL-cleaned. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| except SeerNightShiftRun.DoesNotExist: | ||
| run = ( | ||
| SeerNightShiftRun.objects.filter(organization_id=organization_id) | ||
| .filter(Q(shards__seer_run__uuid=run_uuid) | Q(seer_run__uuid=run_uuid)) |
There was a problem hiding this comment.
This is to be backwards compatible until I perform a data migration.
Sharded runs share one SeerNightShiftRun, so writing per-delivery error_message to the run let one shard's success clear another shard's error, and pinned the legacy seer_run FK to shard_index 0 even if that chunk failed to dispatch. Record delivery errors on the shard, and point the legacy FK at the first successfully dispatched shard. Addresses Cursor review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Nothing outside the delivery fallback reads SeerNightShiftRun.seer_run, and sharded runs resolve via their shards. Leave the scalar FK null on new runs instead of pointing it at the first shard — this drops the first-shard tracking in dispatch and is a step toward removing the column once pre-shard rows are backfilled. The delivery read-fallback stays for those rows until then. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-shard delivery errors record on SeerNightShiftRunShard.extras, but the run serializer read errorMessage only from the run, so a failed shard could read as a healthy run. Surface a shard error_message when the run itself has none. Addresses Cursor review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
shard.seer_run is Optional (nullable OneToOne), so tests asserting on it tripped union-attr under CI mypy. Use the non-null SeerRun from the dispatch helper / captured locals instead. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit b5a6802. Configure here.
When some shards failed to dispatch but at least one succeeded, the run was treated as fully successful and the API errorMessage stayed empty, hiding that candidates in the failed chunks were never triaged. Record a run-level error_message for partial failures (delivery only clears per-shard errors, so it persists) and emit a metric. Addresses Cursor review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| on_run_created=_link_run, | ||
| ) | ||
| except Exception: | ||
| shards = list(chunked(scored, options.get("seer.night_shift.shard_size"))) |
There was a problem hiding this comment.
Bug: The seer.night_shift.shard_size option lacks validation. If set to 0, the chunked function will group all items into a single shard, defeating the purpose of sharding.
Severity: MEDIUM
Suggested Fix
Add validation to the seer.night_shift.shard_size option registration in src/sentry/options/defaults.py to enforce a minimum value of 1. Alternatively, add a check in src/sentry/tasks/seer/night_shift/cron.py before calling chunked to handle a shard_size of 0 or less, perhaps by falling back to the default value or raising an error.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/sentry/tasks/seer/night_shift/cron.py#L539
Potential issue: The `seer.night_shift.shard_size` option, which is modifiable by
operators, lacks validation to prevent non-positive values. If an operator sets this
value to `0`, the `chunked` utility at `src/sentry/tasks/seer/night_shift/cron.py:539`
will not create multiple small shards. Instead, it will silently create a single, large
chunk containing all items. This defeats the purpose of the sharding logic, which is to
prevent performance degradation in the triage agent by processing large candidate sets.
The result is that all candidates are sent to the agent at once.
Did we get this right? 👍 / 👎 to inform future reviews.

Night shift dispatched all scored candidates to one Seer feature run, which degrades the triage agent on large sets. This shards candidates into chunks of
seer.night_shift.shard_size(default 5), dispatching each chunk as its own feature run — somax_candidates=15triggers 3 triage agents of 5.SeerNightShiftRunShardmodel (migration0019) makes run →SeerRunone-to-many; one shard per chunk.SeerRunuuid, falling back to the legacyseer_runFK for pre-shard runs.seer_runFK kept (first shard) for the transition; backfilled and dropped in follow-up PRs.