Add semantic duplicate detection to security-issue-import Step 2a#154
Conversation
|
BTW, I created a little test harness that outputs the tests so you can run it in any LLM. I'm not sure if this is useful or if we want to do this, but here is the output: |
potiuk
left a comment
There was a problem hiding this comment.
This is really nice optimisation! I am going to test it today :)
We are discussing on adding This is slightly different than unit tests, and I think there are various approaches there- but we have not looked closely yet. |
|
Possibly good idea for https://github.com/apache/airflow-steward/discussions |
The existing Step 2a fuzzy-match runs three structured searches (GHSA IDs, code pointers, subject keywords) against existing trackers. These work well when a report carries explicit technical identifiers, but miss the most common real-world duplicate pattern: the same vulnerability reported twice by different people with no shared identifiers, or the same reporter filing again weeks later with different framing.
This PR adds two checks that run after the three-key search, triggered only when no STRONG (GHSA) match was already found:
Semantic comparison pass — fetches titles and the first 300 characters of every open tracker in a single gh issue list call, produces a root-cause summary from the incoming report, and compares against the corpus on four axes: component/subsystem, bug class, attack path, and fix shape. Two-axis overlap = MEDIUM; three or four axes = STRONG (same weight as a GHSA collision — routes to security-issue-deduplicate rather than creating a new tracker).
Reporter-identity check — searches open and recently-closed trackers for the inbound reporter's email local-part. A hit on a related issue counts as MEDIUM even with only one-axis overlap — the primary signal for the same-reporter-different-framing case.
The budget guardrail is updated from 5 to 6 gh calls per candidate to account for the new bulk-list and reporter-identity calls, plus up to 3 follow-up full-body reads on the highest-scoring semantic candidates.
Testing — three synthetic test cases verified manually: clear duplicate (fires STRONG), false-positive trap with same subsystem but different bug class (correctly suppressed), same reporter with different framing (fires STRONG on axes; identity check fires as supporting signal). skill-validate passes with no violations.