Fix: Eliminate redundant full table scans in messages and events collection#3444
Fix: Eliminate redundant full table scans in messages and events collection#3444PredictiveManish wants to merge 4 commits intoaugurlabs:mainfrom
Conversation
sgoggins
left a comment
There was a problem hiding this comment.
One question regarding the flip from 20 to 1,000 size batch minimums.
|
I just have non-blocking code quality suggestion: use a named constant like |
Yep, agreed |
What to give the size 500 or 200? as @MoralCode suggested for 200 |
200 |
|
@MoralCode / @PredictiveManish : I'm rerunning the failed end to end test. Sometimes GitHub gets overwhelmed and they just timeout. |
sgoggins
left a comment
There was a problem hiding this comment.
@PredictiveManish : May we assume you ran this locally and collected data?
Lets not assume - would rather accidentally over-test than not test at all. |
| # Build mappings once before processing any messages | ||
| # create mapping from issue url to issue id of current issues | ||
| issue_url_to_id_map = {} | ||
| issues = augur_db.session.query(Issue).filter(Issue.repo_id == repo_id).all() | ||
| for issue in issues: | ||
| issue_url_to_id_map[issue.issue_url] = issue.issue_id | ||
|
|
||
| # create mapping from pr url to pr id of current pull requests | ||
| pr_issue_url_to_id_map = {} | ||
| prs = augur_db.session.query(PullRequest).filter(PullRequest.repo_id == repo_id).all() | ||
| for pr in prs: | ||
| pr_issue_url_to_id_map[pr.pr_issue_url] = pr.pull_request_id | ||
|
|
There was a problem hiding this comment.
Above you rely on some functions for these mappings, Is this code just a duplicate of the code in those functions? If those functions are useful elsewhere and not tied in anywhere maybe they can be useful utility functions that can be imported here too?
Overall id recommend splitting this PR so that the changes to events.py can be merged without being held up by the larger question of refactoring ( ref #3345) that this file brings up
There was a problem hiding this comment.
@PredictiveManish : I agree with @MoralCode that splitting the events.py from the other refactoring would make this more straightforward to merge and test.
There was a problem hiding this comment.
sounds good, will focus on that PR first. This one is still on hold until we figure out how best to share the same function that was proposed in events.py
There was a problem hiding this comment.
yeah, for events.py I have created new PR as suggested for split.
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
| # Build mappings once before processing any messages | ||
| # create mapping from issue url to issue id of current issues | ||
| issue_url_to_id_map = {} | ||
| issues = augur_db.session.query(Issue).filter(Issue.repo_id == repo_id).all() | ||
| for issue in issues: | ||
| issue_url_to_id_map[issue.issue_url] = issue.issue_id | ||
|
|
||
| # create mapping from pr url to pr id of current pull requests | ||
| pr_issue_url_to_id_map = {} | ||
| prs = augur_db.session.query(PullRequest).filter(PullRequest.repo_id == repo_id).all() | ||
| for pr in prs: | ||
| pr_issue_url_to_id_map[pr.pr_issue_url] = pr.pull_request_id | ||
|
|
There was a problem hiding this comment.
sounds good, will focus on that PR first. This one is still on hold until we figure out how best to share the same function that was proposed in events.py
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
e699932 to
40ec167
Compare
…ection Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
Signed-off-by: PredictiveManish <manish.tiwari.09@zohomail.in>
40ec167 to
ac28a75
Compare
Signed-off-by: Manish Tiwari <manish.tiwari.09@zohomail.in>
Description
Moved mapping queries outside batch loops and pass pre-built mappings as parameters to processing functions, following the pattern established in #3439.
Solves #3440
Changes Made
augur/tasks/github/messages.pyissue_url_to_id_mapandpr_issue_url_to_id_maponce incollect_github_messages()before any batch processingprocess_messages()to accept mappings as parameters instead of rebuilding themprocess_large_issue_and_pr_message_collection()to accept and pass mappingsaugur/tasks/github/events.pyissue_url_to_id_mapandpr_url_to_id_maponce inBulkGithubEventCollection.collect()before the batch loop_process_events(),_process_issue_events(), and_process_pr_events()to accept mappings as parameters_get_map_from_*()calls from batch processing methodsPerformance Improvement
Before: 1,000 messages -> 50 full scans of issues AND PRs tables
After: 1,000 messages -> 1 full scan of each table (50x reduction)
Before: 10,000 events -> 40 full scans total
After: 10,000 events → 1 full scan of each table (40x reduction)
This PR fixes #3440
Notes for Reviewers
Signed commits