Skip to content
This repository was archived by the owner on Nov 4, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
71af697
catch-all exception for docker pull (#366)
sedrick-keh-tri Mar 25, 2025
006a760
Clean diff after setup (#353)
marcmac Mar 28, 2025
e19aabe
Simplify installation guidelines for inference submodule
carlosejimenez Apr 9, 2025
ae22bf4
Fixes #368
carlosejimenez Apr 9, 2025
c8d7763
Fix/missing text column (#376)
carlosejimenez Apr 9, 2025
0c0de95
Merge remote-tracking branch 'refs/remotes/origin/main'
carlosejimenez Apr 17, 2025
9b9b9d3
Docs (#381)
carlosejimenez Apr 18, 2025
65237b8
deploy docs
carlosejimenez Apr 18, 2025
de31aa5
Update pytest workflow python version
carlosejimenez Apr 18, 2025
3f01bd6
Update docs link
carlosejimenez Apr 18, 2025
6c36e5d
Update README.md
carlosejimenez Apr 22, 2025
9f836d4
Doc: Add links to other github repos (#387)
klieret Apr 30, 2025
11710a2
Doc: Fix closing div
klieret May 4, 2025
f98dd10
CI: Remove griffe-pydantic from mkdocs extensions (#391)
klieret May 4, 2025
fea293e
Fix loading of jsonl data (#390)
klieret May 4, 2025
9f55b52
Support multilingual evaluation (#392)
kabirgh May 6, 2025
35a4152
Fix: the pbar doesn't update immediately when futures fail unless the…
gameofby May 6, 2025
3fd9e87
fix: preserve all issue references with same keyword in PRs (#358)
Henry-Jessie May 6, 2025
124897d
Add test for PR #358
john-b-yang May 6, 2025
b627f5c
Update README.md (#388)
mattk7 May 6, 2025
d47ae07
Urgent, there is a bug when generate prompt_col in create_text_datase…
aacedar May 6, 2025
a525e96
Minor #369 fix
john-b-yang May 6, 2025
af0938c
Release 4.0.1
john-b-yang May 7, 2025
35fee16
Release 4.0.2
john-b-yang May 7, 2025
547035d
Release 4.0.3
john-b-yang May 7, 2025
c6e7858
Doc: Update inference.md (#397)
lacabra May 15, 2025
6a83d74
remove problematic content slicing
ryanhoangt May 22, 2025
b8ffb7b
skip content slicing on Modal only
ryanhoangt May 27, 2025
e3a6d5b
Fix docs
john-b-yang May 27, 2025
0c8d9f5
Update PR template
carlosejimenez Jun 1, 2025
54426ef
Add more informative log for prepare_images script
carlosejimenez Jun 1, 2025
fe2d3d1
fixes #58 - overestimating recall for small k (#409)
carlosejimenez Jun 1, 2025
1be8dbb
Update namespace arg type for argparser to do null-conversion implicitly
carlosejimenez Jun 1, 2025
6a932bc
Add clean step for requirements / environments for python constants p…
carlosejimenez Jun 2, 2025
aef44cb
Update README.md
ofirpress Jun 19, 2025
13c622a
fixes #421 - fix mention of private dockerhub link in readme (#422)
Jun 19, 2025
9d79b3e
Fix: Allow empty namespace correctly. Ref #423 (#424)
buaabarty Jun 19, 2025
dcf99be
update modal docs + fix deprecations (#426)
zhang-lucy Jun 24, 2025
cd60a86
Fix mm dev p5js log parsing
carlosejimenez Jun 26, 2025
42703a6
Fix chartjs log parsing
carlosejimenez Jun 26, 2025
fc9161e
support return exceptions (#431)
zhang-lucy Jul 1, 2025
2bf15e1
CI: Queue up doc pushes (#428)
klieret Jul 1, 2025
e32ac10
Fix docs for prediction format
carlosejimenez Jul 17, 2025
665c1ce
fix(build): fix python base images requirement types-setuptools incor…
geeker-smallwhite Jul 17, 2025
63dce46
Add tests for harness utils
carlosejimenez Jul 17, 2025
c7c22a9
Release v4.0.4
carlosejimenez Jul 17, 2025
03846bf
Merge branch 'main' into fix-modal-patch-eval
ryanhoangt Jul 25, 2025
aa0f1ed
add extra validation for make_run_report
ryanhoangt Aug 15, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ been modified for this fix.

#### Any other comments?

🧡 Thanks for contributing!
<!-- 🧡 Thanks for contributing! -->
48 changes: 48 additions & 0 deletions .github/workflows/deploy-docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: build-docs

# Prevent concurrent runs that could conflict when pushing to gh-pages
concurrency:
group: build-docs-${{ github.ref }}
cancel-in-progress: false

on:
push:
branches:
- main
- "build-docs-*"
pull_request:
branches:
- main
permissions:
contents: write
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Configure Git Credentials
run: |
git config user.name github-actions[bot]
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
- uses: actions/setup-python@v5
with:
python-version: 3.x
- run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
- uses: actions/cache@v4
with:
key: mkdocs-material-${{ env.cache_id }}
path: .cache
restore-keys: |
mkdocs-material-
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
- run: uv pip install --python ${Python_ROOT_DIR} '.[docs]'
- name: Build Documentation
if: github.ref != 'refs/heads/main'
run: mkdocs build
- name: Build + Deploy Documentation
if: github.ref == 'refs/heads/main'
run: mike deploy --push 1.0 latest
2 changes: 1 addition & 1 deletion .github/workflows/pytest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
uses: actions/checkout@v2
- uses: actions/setup-python@v5
with:
python-version: '3.9'
python-version: '3.10'
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
Expand Down
64 changes: 42 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
<p align="center">
<a href="http://swe-bench.github.io">
<img src="assets/figures/swellama_banner.svg" style="height: 10em" alt="Kawi the SWE-Llama" />
<img src="docs/assets/figures/swellama_banner.svg" style="height: 10em" alt="Kawi the SWE-Llama" />
</a>
</p>

<div align="center">

| [日本語](docs/README_JP.md) | [English](https://github.com/swe-bench/SWE-bench) | [中文简体](docs/README_CN.md) | [中文繁體](docs/README_TW.md) |
<p align="center"><strong>[&nbsp;<a href="https://swebench.com/SWE-bench/">Read the Docs</a>&nbsp;]</strong></p>

</div>
<p align="center">
<a href="docs/other_languages/README_JP.md">日本語</a> |
<a href="docs/other_languages/README_CN.md">中文简体</a> |
<a href="docs/other_languages/README_TW.md">中文繁體</a>
</p>

<p align="center">
<a href="https://www.python.org/">
Expand All @@ -29,8 +31,8 @@ Code and data for the following works:
* [ICLR 2024 Oral] <a href="https://arxiv.org/abs/2310.06770">SWE-bench: Can Language Models Resolve Real-World GitHub Issues?</a>

## 📰 News
* **[Jan. 13, 2025]**: We've integrated [SWE-bench Multimodal](https://swebench.github.io/multimodal) ([paper](https://arxiv.org/abs/2410.03859), [dataset](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal)) into this repository! Unlike SWE-bench, we've kept evaluation for the test split *private*. Submit to the leaderboard using [sb-cli](https://github.com/swe-bench/sb-cli/tree/main), our new cloud-based evaluation tool.
* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md#%EF%B8%8F-evaluation-with-modal) for more details.
* **[Jan. 13, 2025]**: We've integrated [SWE-bench Multimodal](https://swebench.github.io/multimodal) ([paper](https://arxiv.org/abs/2410.03859), [dataset](https://huggingface.co/datasets/SWE-bench/SWE-bench_Multimodal)) into this repository! Unlike SWE-bench, we've kept evaluation for the test split *private*. Submit to the leaderboard using [sb-cli](https://github.com/swe-bench/sb-cli/tree/main), our new cloud-based evaluation tool.
* **[Jan. 11, 2025]**: Thanks to [Modal](https://modal.com/), you can now run evaluations entirely on the cloud! See [here](https://github.com/swe-bench/SWE-bench/blob/main/docs/assets/evaluation.md#%EF%B8%8F-evaluation-with-modal) for more details.
* **[Aug. 13, 2024]**: Introducing *SWE-bench Verified*! Part 2 of our collaboration with [OpenAI Preparedness](https://openai.com/preparedness/). A subset of 500 problems that real software engineers have confirmed are solvable. Check out more in the [report](https://openai.com/index/introducing-swe-bench-verified/)!
* **[Jun. 27, 2024]**: We have an exciting update for SWE-bench - with support from [OpenAI's Preparedness](https://openai.com/preparedness/) team: We're moving to a fully containerized evaluation harness using Docker for more reproducible evaluations! Read more in our [report](https://github.com/swe-bench/SWE-bench/blob/main/docs/20240627_docker/README.md).
* **[Apr. 2, 2024]**: We have released [SWE-agent](https://github.com/SWE-agent/SWE-agent), which sets the state-of-the-art on the full SWE-bench test set! ([Tweet 🔗](https://twitter.com/jyangballin/status/1775114444370051582))
Expand All @@ -40,7 +42,7 @@ Code and data for the following works:
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub.
Given a *codebase* and an *issue*, a language model is tasked with generating a *patch* that resolves the described problem.

<img src="assets/figures/teaser.png">
<img src="docs/assets/figures/teaser.png">

To access SWE-bench, copy and run the following code:
```python
Expand Down Expand Up @@ -68,6 +70,10 @@ python -m swebench.harness.run_evaluation \
--instance_ids sympy__sympy-20590 \
--run_id validate-gold
```
> [!NOTE]
> If using a MacOS M-series or other ARM-based systems, add `--namespace ''` to the above script.
> By default, the evaluation script pulls images (built for Linux) from [DockerHub](https://hub.docker.com/u/swebench).
> Adding `--namespace ''` will cause evaluation images to be built locally instead.

## 💽 Usage
Evaluate patch predictions on SWE-bench Lite with the following command:
Expand All @@ -79,6 +85,7 @@ python -m swebench.harness.run_evaluation \
--run_id <run_id>
# use --predictions_path 'gold' to verify the gold patches
# use --run_id to name the evaluation run
# use --modal true to run on Modal
```

This command will generate docker build logs (`logs/build_images`) and evaluation logs (`logs/run_evaluation`) in the current directory.
Expand All @@ -99,36 +106,38 @@ To see the full list of arguments for the evaluation harness, run:
python -m swebench.harness.run_evaluation --help
```

See the [evaluation tutorial](assets/evaluation.md) for the full rundown on datasets you can evaluate.
See the [evaluation tutorial](docs/guides/evaluation.md) for the full rundown on datasets you can evaluate.
If you're looking for non-local, cloud based evaluations, check out...
* [sb-cli](https://github.com/swe-bench/sb-cli), our tool for running evaluations automatically on AWS, or...
* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](https://github.com/swe-bench/SWE-bench/blob/main/assets/evaluation.md#%EF%B8%8F-evaluation-with-modal)
* Running SWE-bench evaluation on [Modal](https://modal.com/). Details [here](docs/guides/evaluation.md#Cloud-Based-Evaluation)

Additionally, you can also:
* [Train](https://github.com/swe-bench/SWE-bench/tree/main/swebench/inference/make_datasets) your own models on our pre-processed datasets.
* Run [inference](https://github.com/swe-bench/SWE-bench/blob/main/swebench/inference/README.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
* Run SWE-bench's [data collection procedure](https://github.com/swe-bench/SWE-bench/blob/main/swebench/collect/) ([tutorial](assets/collection.md)) on your own repositories, to make new SWE-Bench tasks.
* [Train](https://github.com/swe-bench/SWE-bench/tree/main/swebench/inference/make_datasets) your own models on our pre-processed datasets. (🆕 Check out [SWE-smith](https://swesmith.com/), a dedicated toolkit for creating SWE training data.)
* Run [inference](docs/reference/inference.md) on existing models (both local and API models). The inference step is where you give the model a repo + issue and have it generate a fix.
* Run SWE-bench's [data collection procedure](https://github.com/swe-bench/SWE-bench/blob/main/swebench/collect/) ([tutorial](docs/guides/collection.md)) on your own repositories, to make new SWE-Bench tasks.
* ⚠️ We are temporarily pausing support for queries around creating SWE-bench instances. Please see the note in the tutorial.

## ⬇️ Downloads
| Datasets | Models | RAG |
| - | - | - |
| [💿 SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) | [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/princeton-nlp/SWE-bench_oracle) |
| [💿 SWE-bench Lite](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) | [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_13K) |
| [💿 SWE-bench Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) | [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_27K) |
| [💿 SWE-bench Multimodal](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Multimodal) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) | [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K) |
| | | [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_50k_llama) |
| [💿 SWE-bench](https://huggingface.co/datasets/SWE-bench/SWE-bench) | [🦙 SWE-Llama 13b](https://huggingface.co/princeton-nlp/SWE-Llama-13b) | [🤗 "Oracle" Retrieval](https://huggingface.co/datasets/SWE-bench/SWE-bench_oracle) |
| [💿 SWE-bench Lite](https://huggingface.co/datasets/SWE-bench/SWE-bench_Lite) | [🦙 SWE-Llama 13b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-13b-peft) | [🤗 BM25 Retrieval 13K](https://huggingface.co/datasets/SWE-bench/SWE-bench_bm25_13K) |
| [💿 SWE-bench Verified](https://huggingface.co/datasets/SWE-bench/SWE-bench_Verified) | [🦙 SWE-Llama 7b](https://huggingface.co/princeton-nlp/SWE-Llama-7b) | [🤗 BM25 Retrieval 27K](https://huggingface.co/datasets/SWE-bench/SWE-bench_bm25_27K) |
| [💿 SWE-bench Multimodal](https://huggingface.co/datasets/SWE-bench/SWE-bench_Multimodal) | [🦙 SWE-Llama 7b (PEFT)](https://huggingface.co/princeton-nlp/SWE-Llama-7b-peft) | [🤗 BM25 Retrieval 40K](https://huggingface.co/datasets/SWE-bench/SWE-bench_bm25_40K) |
| | | [🤗 BM25 Retrieval 50K (Llama tokens)](https://huggingface.co/datasets/SWE-bench/SWE-bench_bm25_50k_llama) |

## 💫 Contributions
We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues!
To do so, please either file a new pull request or issue and fill in the corresponding templates accordingly. We'll be sure to follow up shortly!

Contact person: [Carlos E. Jimenez](http://www.carlosejimenez.com/) and [John Yang](https://john-b-yang.github.io/) (Email: [email protected], [email protected]).

## ✍️ Citation
## ✍️ Citation & license
MIT license. Check `LICENSE.md`.

If you find our work helpful, please use the following citations.

```
```bibtex
@inproceedings{
jimenez2024swebench,
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
Expand All @@ -148,5 +157,16 @@ If you find our work helpful, please use the following citations.
}
```

## 🪪 License
MIT. Check `LICENSE.md`.
## Our Other Projects

<div align="center">
<a href="https://github.com/SWE-bench/sb-cli"><img src="docs/assets/sbcli_logo_text_below.svg" alt="sb-cli" height="120px"></a>
&nbsp;&nbsp;
<a href="https://github.com/SWE-bench/SWE-smith"><img src="docs/assets/swesmith_logo_text_below.svg" alt="SWE-smith" height="120px"></a>
&nbsp;&nbsp;
<a href="https://github.com/SWE-agent/SWE-agent"><img src="docs/assets/sweagent_logo_text_below.svg" alt="SWE-agent" height="120px"></a>
&nbsp;&nbsp;
<a href="https://github.com/SWE-agent/SWE-ReX"><img src="docs/assets/swerex_logo_text_below.svg" alt="SWE-ReX" height="120px"></a>
&nbsp;&nbsp;
<!-- <a href="https://github.com/SWE-bench/SWE-bench"><img src="docs/assets/swebench_logo_text_below.svg" alt="SWE-bench" height="120px"></a> -->
</div>
145 changes: 0 additions & 145 deletions docs/20240406_devin_validate/get_devin_preds.ipynb

This file was deleted.

24 changes: 0 additions & 24 deletions docs/20240406_devin_validate/report.md

This file was deleted.

Loading