I originally built this project to create an evaluation dataset for a code review bot company. It provides a set of tools to scrape GitHub pull request comments, identify potential bug reports, and classify them by difficulty. The resulting dataset can be used for training and evaluating models that detect bugs or analyze code review practices.
The process is broken down into several steps, managed by different scripts:
- Scrape Data: The
scraper.tsscript fetches PR review comments from GitHub based on a search query. - Review Bugs: The raw comments are reviewed by an LLM to determine if they represent valid bug reports. This is handled by
review-bugs.ts. - Assess Difficulty: Bugs that pass the review are then assessed for difficulty (Easy, Medium, Hard) using
review-missing-difficulty.ts. - Analyze Data: You can get statistics on the collected data using
analyze-bugs.ts. - Extract Data: Finally, you can extract subsets of the data, for example, all "easy" bugs, using scripts like
extract-easy-bugs.ts.
These are the main scripts that drive the workflow.
scraper.ts: Scrapes GitHub for PR review comments that might contain bug reports. It saves the results inbugs.json.review-bugs.ts: Uses an LLM to review the comments inbugs.jsonand determines if each one is a valid bug report.analyze-bugs.ts: Provides statistics on the dataset, such as the number of bugs, review pass rate, and difficulty breakdown.review-missing-difficulty.ts: Finds bugs inbugs.jsonthat are missing a difficulty assessment and uses an LLM to classify them.extract-easy-bugs.ts: Extracts all bugs marked as "easy" frombugs.jsonand saves them toeasy-bugs.json.extract-medium-bugs.ts: Extracts a specified number of bugs marked as "medium" frombugs.jsonand saves them tomedium-bugs.json.
These scripts are used for more specific or one-off tasks.
review-single.ts: Reviews a single bug by providing a link to the comment.review-single-difficulty.ts: Assesses the difficulty for a single bug.generate-balanced-sample.ts: Generates a smaller, balanced sample of bugs from different repositories.remove-*.ts: Various scripts for cleaning up the data (e.g.,remove-failed-reviews.ts,remove-self-comments.ts).
bugs.json: The main data file containing all the scraped comments and their metadata, including review status and difficulty.easy-bugs.json: A subset ofbugs.jsoncontaining only the bugs classified as "easy".medium-bugs.json: A subset ofbugs.jsoncontaining bugs classified as "medium".scanned_prs.txt: A log file that keeps track of the PRs that have already been scanned to avoid duplicate work.
-
Install dependencies:
npm install
-
Create an environment file: Create a
.envfile in the root of the project and add your GitHub and OpenAI API keys:GITHUB_TOKEN=your_github_token OPENAI_API_KEY=your_openai_api_key
You can run the scripts using npx ts-node.
-
Scrape for new bugs:
npx ts-node scraper.ts
-
Review all bugs:
npx ts-node review-bugs.ts
-
Assess difficulty for unassessed bugs:
npx ts-node review-missing-difficulty.ts
-
Analyze the dataset:
npx ts-node analyze-bugs.ts
-
Extract all easy bugs:
npx ts-node extract-easy-bugs.ts
-
Extract 300 medium bugs:
npx ts-node extract-medium-bugs.ts 300
- Add support for scraping GitLab and other platforms.
- Improve the accuracy of the bug identification model.
- Create a web interface for easier data exploration.