Skip to content

A benchmark dataset of real-world code review comments, designed to evaluate automated code review software/agents.

moritzWa/BugDetectionBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BugDetectionBench

I originally built this project to create an evaluation dataset for a code review bot company. It provides a set of tools to scrape GitHub pull request comments, identify potential bug reports, and classify them by difficulty. The resulting dataset can be used for training and evaluating models that detect bugs or analyze code review practices.

Workflow

The process is broken down into several steps, managed by different scripts:

  1. Scrape Data: The scraper.ts script fetches PR review comments from GitHub based on a search query.
  2. Review Bugs: The raw comments are reviewed by an LLM to determine if they represent valid bug reports. This is handled by review-bugs.ts.
  3. Assess Difficulty: Bugs that pass the review are then assessed for difficulty (Easy, Medium, Hard) using review-missing-difficulty.ts.
  4. Analyze Data: You can get statistics on the collected data using analyze-bugs.ts.
  5. Extract Data: Finally, you can extract subsets of the data, for example, all "easy" bugs, using scripts like extract-easy-bugs.ts.

File Structure

Core Scripts

These are the main scripts that drive the workflow.

  • scraper.ts: Scrapes GitHub for PR review comments that might contain bug reports. It saves the results in bugs.json.
  • review-bugs.ts: Uses an LLM to review the comments in bugs.json and determines if each one is a valid bug report.
  • analyze-bugs.ts: Provides statistics on the dataset, such as the number of bugs, review pass rate, and difficulty breakdown.
  • review-missing-difficulty.ts: Finds bugs in bugs.json that are missing a difficulty assessment and uses an LLM to classify them.
  • extract-easy-bugs.ts: Extracts all bugs marked as "easy" from bugs.json and saves them to easy-bugs.json.
  • extract-medium-bugs.ts: Extracts a specified number of bugs marked as "medium" from bugs.json and saves them to medium-bugs.json.

Utility Scripts

These scripts are used for more specific or one-off tasks.

  • review-single.ts: Reviews a single bug by providing a link to the comment.
  • review-single-difficulty.ts: Assesses the difficulty for a single bug.
  • generate-balanced-sample.ts: Generates a smaller, balanced sample of bugs from different repositories.
  • remove-*.ts: Various scripts for cleaning up the data (e.g., remove-failed-reviews.ts, remove-self-comments.ts).

Data Files

  • bugs.json: The main data file containing all the scraped comments and their metadata, including review status and difficulty.
  • easy-bugs.json: A subset of bugs.json containing only the bugs classified as "easy".
  • medium-bugs.json: A subset of bugs.json containing bugs classified as "medium".
  • scanned_prs.txt: A log file that keeps track of the PRs that have already been scanned to avoid duplicate work.

Setup

  1. Install dependencies:

    npm install
  2. Create an environment file: Create a .env file in the root of the project and add your GitHub and OpenAI API keys:

    GITHUB_TOKEN=your_github_token
    OPENAI_API_KEY=your_openai_api_key
    

Usage

You can run the scripts using npx ts-node.

  • Scrape for new bugs:

    npx ts-node scraper.ts
  • Review all bugs:

    npx ts-node review-bugs.ts
  • Assess difficulty for unassessed bugs:

    npx ts-node review-missing-difficulty.ts
  • Analyze the dataset:

    npx ts-node analyze-bugs.ts
  • Extract all easy bugs:

    npx ts-node extract-easy-bugs.ts
  • Extract 300 medium bugs:

    npx ts-node extract-medium-bugs.ts 300

Future Work

  • Add support for scraping GitLab and other platforms.
  • Improve the accuracy of the bug identification model.
  • Create a web interface for easier data exploration.

About

A benchmark dataset of real-world code review comments, designed to evaluate automated code review software/agents.

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •