Skip to content

feat(seer): Shard night shift triage into per-chunk feature runs#118080

Merged
chromy merged 8 commits into
masterfrom
telkins/night-shift-shard-triage
Jun 19, 2026
Merged

feat(seer): Shard night shift triage into per-chunk feature runs#118080
chromy merged 8 commits into
masterfrom
telkins/night-shift-shard-triage

Conversation

@trevor-e

Copy link
Copy Markdown
Member

Night shift dispatched all scored candidates to one Seer feature run, which degrades the triage agent on large sets. This shards candidates into chunks of seer.night_shift.shard_size (default 5), dispatching each chunk as its own feature run — so max_candidates=15 triggers 3 triage agents of 5.

  • New SeerNightShiftRunShard model (migration 0019) makes run → SeerRun one-to-many; one shard per chunk.
  • Delivery resolves the run via a shard's SeerRun uuid, falling back to the legacy seer_run FK for pre-shard runs.
  • Legacy seer_run FK kept (first shard) for the transition; backfilled and dropped in follow-up PRs.

A night shift run dispatched all scored candidates to a single Seer feature
run. Large candidate sets degrade a single triage agent (limited time and
context), so split candidates into chunks of seer.night_shift.shard_size
(default 5) and dispatch each chunk as its own feature run / SeerRun.

Each shard is recorded as a new SeerNightShiftRunShard, making the run ->
SeerRun relationship one-to-many. Result delivery resolves the run via a
shard's SeerRun uuid, falling back to the legacy scalar seer_run FK for runs
created before sharding. The legacy FK still points at the first shard during
the transition; it is backfilled and dropped in follow-up PRs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label Jun 18, 2026
The shard model is generic to any night shift workflow, not just triage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

This PR has a migration; here is the generated SQL for src/sentry/seer/migrations/0019_add_night_shift_run_shard.py

for 0019_add_night_shift_run_shard in seer

--
-- Create model SeerNightShiftRunShard
--
CREATE TABLE "seer_nightshiftrunshard" ("id" bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, "date_updated" timestamp with time zone NOT NULL, "date_added" timestamp with time zone NOT NULL, "extras" jsonb DEFAULT '{}'::jsonb NOT NULL, "run_id" bigint NOT NULL, "seer_run_id" bigint NULL UNIQUE);
ALTER TABLE "seer_nightshiftrunshard" ADD CONSTRAINT "seer_nightshiftrunsh_run_id_574f3dc7_fk_seer_nigh" FOREIGN KEY ("run_id") REFERENCES "seer_nightshiftrun" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "seer_nightshiftrunshard" VALIDATE CONSTRAINT "seer_nightshiftrunsh_run_id_574f3dc7_fk_seer_nigh";
ALTER TABLE "seer_nightshiftrunshard" ADD CONSTRAINT "seer_nightshiftrunshard_seer_run_id_aa8616a4_fk_seer_seerrun_id" FOREIGN KEY ("seer_run_id") REFERENCES "seer_seerrun" ("id") DEFERRABLE INITIALLY DEFERRED NOT VALID;
ALTER TABLE "seer_nightshiftrunshard" VALIDATE CONSTRAINT "seer_nightshiftrunshard_seer_run_id_aa8616a4_fk_seer_seerrun_id";
CREATE INDEX CONCURRENTLY "seer_nightshiftrunshard_run_id_574f3dc7" ON "seer_nightshiftrunshard" ("run_id");

Each SeerRun is dispatched for exactly one shard, so model the link as
one-to-one to enforce the invariant at the DB level. Keep SET_NULL: the SeerRun
is a mirror the shard references, not its owner, and gets TTL-cleaned.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread src/sentry/seer/night_shift/delivery.py
Comment thread src/sentry/tasks/seer/night_shift/cron.py Outdated
Comment thread src/sentry/tasks/seer/night_shift/cron.py Outdated
except SeerNightShiftRun.DoesNotExist:
run = (
SeerNightShiftRun.objects.filter(organization_id=organization_id)
.filter(Q(shards__seer_run__uuid=run_uuid) | Q(seer_run__uuid=run_uuid))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to be backwards compatible until I perform a data migration.

Sharded runs share one SeerNightShiftRun, so writing per-delivery
error_message to the run let one shard's success clear another shard's error,
and pinned the legacy seer_run FK to shard_index 0 even if that chunk failed
to dispatch. Record delivery errors on the shard, and point the legacy FK at
the first successfully dispatched shard. Addresses Cursor review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread src/sentry/seer/night_shift/delivery.py
trevor-e and others added 3 commits June 18, 2026 20:18
Nothing outside the delivery fallback reads SeerNightShiftRun.seer_run, and
sharded runs resolve via their shards. Leave the scalar FK null on new runs
instead of pointing it at the first shard — this drops the first-shard tracking
in dispatch and is a step toward removing the column once pre-shard rows are
backfilled. The delivery read-fallback stays for those rows until then.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-shard delivery errors record on SeerNightShiftRunShard.extras, but the run
serializer read errorMessage only from the run, so a failed shard could read as
a healthy run. Surface a shard error_message when the run itself has none.
Addresses Cursor review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
shard.seer_run is Optional (nullable OneToOne), so tests asserting on it
tripped union-attr under CI mypy. Use the non-null SeerRun from the dispatch
helper / captured locals instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit b5a6802. Configure here.

Comment thread src/sentry/tasks/seer/night_shift/cron.py
When some shards failed to dispatch but at least one succeeded, the run was
treated as fully successful and the API errorMessage stayed empty, hiding that
candidates in the failed chunks were never triaged. Record a run-level
error_message for partial failures (delivery only clears per-shard errors, so
it persists) and emit a metric. Addresses Cursor review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@chromy chromy marked this pull request as ready for review June 19, 2026 07:41
@chromy chromy requested review from a team as code owners June 19, 2026 07:41
on_run_created=_link_run,
)
except Exception:
shards = list(chunked(scored, options.get("seer.night_shift.shard_size")))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The seer.night_shift.shard_size option lacks validation. If set to 0, the chunked function will group all items into a single shard, defeating the purpose of sharding.
Severity: MEDIUM

Suggested Fix

Add validation to the seer.night_shift.shard_size option registration in src/sentry/options/defaults.py to enforce a minimum value of 1. Alternatively, add a check in src/sentry/tasks/seer/night_shift/cron.py before calling chunked to handle a shard_size of 0 or less, perhaps by falling back to the default value or raising an error.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/sentry/tasks/seer/night_shift/cron.py#L539

Potential issue: The `seer.night_shift.shard_size` option, which is modifiable by
operators, lacks validation to prevent non-positive values. If an operator sets this
value to `0`, the `chunked` utility at `src/sentry/tasks/seer/night_shift/cron.py:539`
will not create multiple small shards. Instead, it will silently create a single, large
chunk containing all items. This defeats the purpose of the sharding logic, which is to
prevent performance degradation in the triage agent by processing large candidate sets.
The result is that all candidates are sent to the agent at once.

Did we get this right? 👍 / 👎 to inform future reviews.

@chromy chromy merged commit 626bb09 into master Jun 19, 2026
86 checks passed
@chromy chromy deleted the telkins/night-shift-shard-triage branch June 19, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants