Skip to content

[facade] fix contributor resolution permanently skipping commits with null cmt_ght_author_id#3792

Closed
mn-ram wants to merge 3 commits intoaugurlabs:mainfrom
mn-ram:fix/facade-contributor-resolution-skips-null-cmt-ght-author
Closed

[facade] fix contributor resolution permanently skipping commits with null cmt_ght_author_id#3792
mn-ram wants to merge 3 commits intoaugurlabs:mainfrom
mn-ram:fix/facade-contributor-resolution-skips-null-cmt-ght-author

Conversation

@mn-ram
Copy link
Copy Markdown

@mn-ram mn-ram commented Mar 25, 2026

Changeset

  • Replace the new_contrib_sql query in insert_facade_contributors — the old query selected unresolved commits by checking whether the commit's email was absent from the contributors/contributors_aliases tables, gated behind a last_collection_date window. Now uses cmt_ght_author_id IS NULL as the sole eligibility signal, excluding known-dead emails via unresolved_commit_emails.
  • Replace the resolve_email_to_cntrb_id_sql query — the old query had the same date window. New version uses a CTE that unions all three contributor email columns (cntrb_email, cntrb_canonical, alias_email), deduplicates with DISTINCT ON, then inner-joins against commits where cmt_ght_author_id IS NULL AND repo_id = :repo_id.
  • Remove the CollectionStatus lookup block and last_collected_date variable — no longer needed after removing both since_date bind params.
  • Remove the now-unused get_session, execute_session_query import.

Notes

The old email-NOT-EXISTS logic had two compounding failure modes:

  1. Silent permanent skip — if a commit email was later linked to a GitHub account (contributor or alias row inserted after the commit was collected), the email now EXISTS in the contributors table and the commit silently falls out of the selection set. cmt_ght_author_id stays NULL forever with no error raised.
  2. Date window makes it permanent — the last_collection_date cutoff added in PR add date filter to contributer resolution logic queries #3253 meant any commit that slipped through with an older data_collection_date was excluded on every subsequent run, making recovery impossible without a manual full-recollect flag.

cmt_ght_author_id IS NULL is the only semantically correct question: "has this commit been linked to a contributor yet?" It's also what enables the self-recovery path described in #3779 — setting cmt_ght_author_id = NULL on corrupted rows (from #3740) is now sufficient to re-queue them on the next collection cycle.

The unresolved_commit_emails exclusion in query 1 preserves the existing short-circuit: emails that have already been confirmed unresolvable via GitHub API aren't re-queried needlessly.

Related issues/PRs

@mn-ram mn-ram requested a review from sgoggins as a code owner March 25, 2026 19:54
@mn-ram mn-ram force-pushed the fix/facade-contributor-resolution-skips-null-cmt-ght-author branch from 999d897 to 210f53b Compare March 25, 2026 19:56
…t_author_id

The insert_facade_contributors task was selecting unresolved commits via
email-table cross-checks (NOT EXISTS against contributors/aliases) gated
behind a last_collection_date window introduced in PR augurlabs#3253.

This had two compounding flaws:
- If a commit email was later linked to a GitHub account, it became
  invisible to the query (the email now EXISTS in contributors), leaving
  cmt_ght_author_id permanently NULL with no path for self-recovery.
- The date window meant any commit that slipped through with an older
  data_collection_date was systematically excluded on every subsequent run.

Fix: replace both queries with cmt_ght_author_id IS NULL as the sole
eligibility signal — the only correct definition of "needs resolution".
Query 1 excludes known-unresolvable emails via unresolved_commit_emails.
Query 2 uses a CTE that unions all three contributor email columns
(cntrb_email, cntrb_canonical, alias_email), deduplicates, then inner-
joins against unlinked commits. Both queries drop the since_date bind
param entirely; the CollectionStatus lookup is removed as now unused.

Fixes augurlabs#3779

Signed-off-by: mn-ram <[email protected]>
@mn-ram mn-ram force-pushed the fix/facade-contributor-resolution-skips-null-cmt-ght-author branch from 210f53b to 3d7babf Compare March 25, 2026 19:58
Copy link
Copy Markdown
Collaborator

@MoralCode MoralCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall this seems to be mostly an application of the suggested queries in the underlying issue. Had some feedback about the code

# Only target commits that are still unlinked (cmt_ght_author_id IS NULL)
# so we never re-process already-resolved records. Removing the
# last_collection_date guard means historical commits that slipped through
# on first pass are finally eligible for resolution.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is very verbose and describes almost line by line what the below query does. Can you make the comment more high level so it complements, rather than duplicating, the understanding someone would get by reading the code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the comment.

FROM email_to_contributor
ORDER BY email
)
SELECT DISTINCT
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the suggested query cali wrote in the issue doesnt have a distinct here. Can you elaborate on why this was added to the query?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed DISTINCT from the outer SELECT — the deduplicated CTE already ensures one cntrb_id per email via DISTINCT ON (email), so the outer DISTINCT was redundant.

Comment on lines +199 to +205
# Find all commits not yet linked to a contributor. The correct signal for
# "needs resolution" is a NULL cmt_ght_author_id — not a date window or an
# email-table cross-check. The old email-join approach silently skipped
# commits whose emails were later linked to a GitHub account, and the
# last_collection_date cutoff (PR #3253) made that permanent. Commits
# already marked unresolvable are excluded via the unresolved_commit_emails
# table so we don't hammer the GitHub API on known dead-ends.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is fairly verbose - could we simplify it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified the comment.

@mn-ram mn-ram force-pushed the fix/facade-contributor-resolution-skips-null-cmt-ght-author branch from 7bf784a to 3e0b8bc Compare March 26, 2026 16:04
@mn-ram mn-ram force-pushed the fix/facade-contributor-resolution-skips-null-cmt-ght-author branch from 3e0b8bc to bcc4690 Compare March 26, 2026 16:05
@mn-ram mn-ram requested a review from MoralCode March 26, 2026 16:09
@mn-ram
Copy link
Copy Markdown
Author

mn-ram commented Mar 26, 2026

Hi @sgoggins, I’ve addressed all of MoralCode’s review feedback and also fixed the schema prefix for unresolved_commit_emails and corrected the email column to cmt_author_email. All checks are passing now — would appreciate a re-review. Thanks!

@MoralCode
Copy link
Copy Markdown
Collaborator

@mn-ram thanks for addressing the feedback so quickly.

I may have been a little premature to review your PR since I have been actively working on resolving the same issue that this PR solves.

I have pulled some of the changes you made here that I hadnt gotten to yet, such as being a little more specific about the schema being used in these queries, and the text of some of your comments, and have pulled them over to my PR #3797 with you credited as a co-author.

I'm going to close this PR so we don't end up duplicating any more work from each other. Thanks for your contribution though!

@MoralCode MoralCode closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

current insert_facade_contributors logic not linking previously resolved contributors

2 participants