Skip to content

Fix: Eliminate redundant full table scans in messages and events collection#3444

Draft
PredictiveManish wants to merge 4 commits intoaugurlabs:mainfrom
PredictiveManish:map-once-update
Draft

Fix: Eliminate redundant full table scans in messages and events collection#3444
PredictiveManish wants to merge 4 commits intoaugurlabs:mainfrom
PredictiveManish:map-once-update

Conversation

@PredictiveManish
Copy link
Copy Markdown
Contributor

@PredictiveManish PredictiveManish commented Dec 6, 2025

Description

Moved mapping queries outside batch loops and pass pre-built mappings as parameters to processing functions, following the pattern established in #3439.
Solves #3440

Changes Made

augur/tasks/github/messages.py

  • Built issue_url_to_id_map and pr_issue_url_to_id_map once in collect_github_messages() before any batch processing
  • Updated process_messages() to accept mappings as parameters instead of rebuilding them
  • Updated process_large_issue_and_pr_message_collection() to accept and pass mappings
  • Increased batch size from 20 to 1000 (reduces batch overhead)

augur/tasks/github/events.py

  • Built issue_url_to_id_map and pr_url_to_id_map once in BulkGithubEventCollection.collect() before the batch loop
  • Updated _process_events(), _process_issue_events(), and _process_pr_events() to accept mappings as parameters
  • Removed redundant _get_map_from_*() calls from batch processing methods

Performance Improvement

  • Before: 1,000 messages -> 50 full scans of issues AND PRs tables

  • After: 1,000 messages -> 1 full scan of each table (50x reduction)

  • Before: 10,000 events -> 40 full scans total

  • After: 10,000 events → 1 full scan of each table (40x reduction)

This PR fixes #3440

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

@sgoggins sgoggins self-assigned this Dec 6, 2025
sgoggins
sgoggins previously approved these changes Dec 6, 2025
Copy link
Copy Markdown
Collaborator

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question regarding the flip from 20 to 1,000 size batch minimums.

@shlokgilda
Copy link
Copy Markdown
Collaborator

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

@MoralCode
Copy link
Copy Markdown
Collaborator

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

Yep, agreed

@PredictiveManish
Copy link
Copy Markdown
Contributor Author

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

What to give the size 500 or 200? as @MoralCode suggested for 200

@sgoggins sgoggins added the database Related to Augur's unifed data model label Dec 9, 2025
@sgoggins
Copy link
Copy Markdown
Collaborator

sgoggins commented Dec 9, 2025

I just have non-blocking code quality suggestion: use a named constant like MESSAGE_BATCH_SIZE = 500 instead of a magic number

What to give the size 500 or 200? as @MoralCode suggested for 200

200

@sgoggins
Copy link
Copy Markdown
Collaborator

sgoggins commented Dec 9, 2025

@MoralCode / @PredictiveManish : I'm rerunning the failed end to end test. Sometimes GitHub gets overwhelmed and they just timeout.

sgoggins
sgoggins previously approved these changes Dec 9, 2025
Copy link
Copy Markdown
Collaborator

@sgoggins sgoggins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PredictiveManish : May we assume you ran this locally and collected data?

@MoralCode
Copy link
Copy Markdown
Collaborator

May we assume you ran this locally and collected data?

Lets not assume - would rather accidentally over-test than not test at all.

Comment on lines +40 to +52
# Build mappings once before processing any messages
# create mapping from issue url to issue id of current issues
issue_url_to_id_map = {}
issues = augur_db.session.query(Issue).filter(Issue.repo_id == repo_id).all()
for issue in issues:
issue_url_to_id_map[issue.issue_url] = issue.issue_id

# create mapping from pr url to pr id of current pull requests
pr_issue_url_to_id_map = {}
prs = augur_db.session.query(PullRequest).filter(PullRequest.repo_id == repo_id).all()
for pr in prs:
pr_issue_url_to_id_map[pr.pr_issue_url] = pr.pull_request_id

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above you rely on some functions for these mappings, Is this code just a duplicate of the code in those functions? If those functions are useful elsewhere and not tied in anywhere maybe they can be useful utility functions that can be imported here too?

Overall id recommend splitting this PR so that the changes to events.py can be merged without being held up by the larger question of refactoring ( ref #3345) that this file brings up

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PredictiveManish : I agree with @MoralCode that splitting the events.py from the other refactoring would make this more straightforward to merge and test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered changes in events.py in #3479

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will focus on that PR first. This one is still on hold until we figure out how best to share the same function that was proposed in events.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, for events.py I have created new PR as suggested for split.

PredictiveManish added a commit to PredictiveManish/augur that referenced this pull request Dec 18, 2025
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
ABrain7710
ABrain7710 previously approved these changes Jan 7, 2026
Comment on lines +40 to +52
# Build mappings once before processing any messages
# create mapping from issue url to issue id of current issues
issue_url_to_id_map = {}
issues = augur_db.session.query(Issue).filter(Issue.repo_id == repo_id).all()
for issue in issues:
issue_url_to_id_map[issue.issue_url] = issue.issue_id

# create mapping from pr url to pr id of current pull requests
pr_issue_url_to_id_map = {}
prs = augur_db.session.query(PullRequest).filter(PullRequest.repo_id == repo_id).all()
for pr in prs:
pr_issue_url_to_id_map[pr.pr_issue_url] = pr.pull_request_id

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, will focus on that PR first. This one is still on hold until we figure out how best to share the same function that was proposed in events.py

@MoralCode MoralCode added discussion Seeking active feedback, usually for items under active development pending changes PRs that have review comments that have been added but havent been addressed in a while labels Jan 8, 2026
PredictiveManish added a commit to PredictiveManish/augur that referenced this pull request Jan 9, 2026
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
…ection

Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
@sgoggins sgoggins added this to the v0.93.0 milestone Jan 21, 2026
@MoralCode MoralCode added the waiting This change is waiting for some other changes to land first label Jan 27, 2026
@MoralCode MoralCode marked this pull request as draft January 27, 2026 03:08
@MoralCode MoralCode modified the milestones: v0.93.0, v0.94.0 Mar 6, 2026
Signed-off-by: Manish Tiwari <manish.tiwari.09@zohomail.in>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

database Related to Augur's unifed data model discussion Seeking active feedback, usually for items under active development pending changes PRs that have review comments that have been added but havent been addressed in a while waiting This change is waiting for some other changes to land first

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Full table scans on every batch in messages and events collection

5 participants