[facade] fix contributor resolution permanently skipping commits with null cmt_ght_author_id#3792
Conversation
999d897 to
210f53b
Compare
…t_author_id The insert_facade_contributors task was selecting unresolved commits via email-table cross-checks (NOT EXISTS against contributors/aliases) gated behind a last_collection_date window introduced in PR augurlabs#3253. This had two compounding flaws: - If a commit email was later linked to a GitHub account, it became invisible to the query (the email now EXISTS in contributors), leaving cmt_ght_author_id permanently NULL with no path for self-recovery. - The date window meant any commit that slipped through with an older data_collection_date was systematically excluded on every subsequent run. Fix: replace both queries with cmt_ght_author_id IS NULL as the sole eligibility signal — the only correct definition of "needs resolution". Query 1 excludes known-unresolvable emails via unresolved_commit_emails. Query 2 uses a CTE that unions all three contributor email columns (cntrb_email, cntrb_canonical, alias_email), deduplicates, then inner- joins against unlinked commits. Both queries drop the since_date bind param entirely; the CollectionStatus lookup is removed as now unused. Fixes augurlabs#3779 Signed-off-by: mn-ram <[email protected]>
210f53b to
3d7babf
Compare
MoralCode
left a comment
There was a problem hiding this comment.
overall this seems to be mostly an application of the suggested queries in the underlying issue. Had some feedback about the code
| # Only target commits that are still unlinked (cmt_ght_author_id IS NULL) | ||
| # so we never re-process already-resolved records. Removing the | ||
| # last_collection_date guard means historical commits that slipped through | ||
| # on first pass are finally eligible for resolution. |
There was a problem hiding this comment.
this comment is very verbose and describes almost line by line what the below query does. Can you make the comment more high level so it complements, rather than duplicating, the understanding someone would get by reading the code?
| FROM email_to_contributor | ||
| ORDER BY email | ||
| ) | ||
| SELECT DISTINCT |
There was a problem hiding this comment.
the suggested query cali wrote in the issue doesnt have a distinct here. Can you elaborate on why this was added to the query?
There was a problem hiding this comment.
Removed DISTINCT from the outer SELECT — the deduplicated CTE already ensures one cntrb_id per email via DISTINCT ON (email), so the outer DISTINCT was redundant.
| # Find all commits not yet linked to a contributor. The correct signal for | ||
| # "needs resolution" is a NULL cmt_ght_author_id — not a date window or an | ||
| # email-table cross-check. The old email-join approach silently skipped | ||
| # commits whose emails were later linked to a GitHub account, and the | ||
| # last_collection_date cutoff (PR #3253) made that permanent. Commits | ||
| # already marked unresolvable are excluded via the unresolved_commit_emails | ||
| # table so we don't hammer the GitHub API on known dead-ends. |
There was a problem hiding this comment.
This comment is fairly verbose - could we simplify it?
7bf784a to
3e0b8bc
Compare
Signed-off-by: mn-ram <[email protected]>
3e0b8bc to
bcc4690
Compare
Signed-off-by: mn-ram <[email protected]>
|
Hi @sgoggins, I’ve addressed all of MoralCode’s review feedback and also fixed the schema prefix for unresolved_commit_emails and corrected the email column to cmt_author_email. All checks are passing now — would appreciate a re-review. Thanks! |
|
@mn-ram thanks for addressing the feedback so quickly. I may have been a little premature to review your PR since I have been actively working on resolving the same issue that this PR solves. I have pulled some of the changes you made here that I hadnt gotten to yet, such as being a little more specific about the schema being used in these queries, and the text of some of your comments, and have pulled them over to my PR #3797 with you credited as a co-author. I'm going to close this PR so we don't end up duplicating any more work from each other. Thanks for your contribution though! |
Changeset
new_contrib_sqlquery ininsert_facade_contributors— the old query selected unresolved commits by checking whether the commit's email was absent from thecontributors/contributors_aliasestables, gated behind alast_collection_datewindow. Now usescmt_ght_author_id IS NULLas the sole eligibility signal, excluding known-dead emails viaunresolved_commit_emails.resolve_email_to_cntrb_id_sqlquery — the old query had the same date window. New version uses a CTE that unions all three contributor email columns (cntrb_email,cntrb_canonical,alias_email), deduplicates withDISTINCT ON, then inner-joins against commits wherecmt_ght_author_id IS NULL AND repo_id = :repo_id.CollectionStatuslookup block andlast_collected_datevariable — no longer needed after removing bothsince_datebind params.get_session, execute_session_queryimport.Notes
The old email-NOT-EXISTS logic had two compounding failure modes:
cmt_ght_author_idstaysNULLforever with no error raised.last_collection_datecutoff added in PR add date filter to contributer resolution logic queries #3253 meant any commit that slipped through with an olderdata_collection_datewas excluded on every subsequent run, making recovery impossible without a manual full-recollect flag.cmt_ght_author_id IS NULLis the only semantically correct question: "has this commit been linked to a contributor yet?" It's also what enables the self-recovery path described in #3779 — settingcmt_ght_author_id = NULLon corrupted rows (from #3740) is now sufficient to re-queue them on the next collection cycle.The
unresolved_commit_emailsexclusion in query 1 preserves the existing short-circuit: emails that have already been confirmed unresolvable via GitHub API aren't re-queried needlessly.Related issues/PRs
since_datechange introduced in add date filter to contributer resolution logic queries #3253 for this specific query