Discrepancies in Agent Evaluation: Unit Test Coverage, Asymmetric Feedback, Test Leakage, and Leaderboard Ambiguity

Hi PatchEval Authors,

Thank you for releasing PatchEval. It is a highly valuable benchmark for automated vulnerability repair (AVR) with a great focus on real-world, multi-language CVEs and dynamic sandbox testing.

While diving into the evaluation framework and the container setups to understand how agents are evaluated (specifically in settings S1.4 and S2.2), we noticed a few critical discrepancies between the paper's claims and the actual codebase implementation. We would appreciate your clarifications on the following points:

**1. Strict metric interpretability under missing unit tests**

The paper frames successful repair under strict criteria and reports headline results over **230** cases (e.g., `23.0% (53/230)` in `arXiv-2511.11019v1/src/Introdution.tex:59` and `arXiv-2511.11019v1/src/Evaluation.tex:119`; benchmark overview also states 230 sandboxed CVEs in `README.md:32` and `docs/HUGGINGFACE_README.md:4`).

However, based on `patcheval_dataset.json` metadata and evaluation logic:

- We counted from dataset fields: In the 230 sandboxed cases (`poc_test_cmd != null`), 176 have `unit_test_cmd` and 54 do not (derived from `poc_test_cmd`/`unit_test_cmd`; schema explicitly allows `unit_test_cmd` to be null in `docs/HUGGINGFACE_README.md:19`).
- In evaluation code, missing `unit_test.sh` is treated as pass:
  - existence check: `patcheval/evaluation/run_evaluation.py:170`
  - if missing: `return True, "No unit-test"` at `patcheval/evaluation/run_evaluation.py:173`
- Strict success is then assigned when both PoC and `unittest_result` are truthy (`patcheval/evaluation/run_evaluation.py:206` to `patcheval/evaluation/run_evaluation.py:207`).
- Final strict pass rate denominator is `total_patches = len(patchs)` (`patcheval/evaluation/run_evaluation.py:393`) passed into strict summary (`patcheval/evaluation/run_evaluation.py:395` to `patcheval/evaluation/run_evaluation.py:396`), i.e., typically 230 in artifact runs.

**Question:** Since 54/230 cases have no unit-test script and are effectively treated as unit-pass in strict mode, are there plans to additionally report strict “PoC & Unit” metrics over only the 176 cases that actually support regression testing?

---

**2. Asymmetric Feedback Loop in Agent Settings (S1.4 & S2.2)**

The paper emphasizes feedback-driven iterative refinement, but in agent wrappers, feedback appears PoC-only:

- SWE-Agent feedback template instructs repeated `check_vulnerability` runs (`patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:25`, `patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:27`).
- SWE-Agent `check_vulnerability` tool executes `bash /workspace/vul-run.sh` (`patcheval/exp_agent/sweagent/sweagent_diff.patch:422`), not `unit_test.sh`.
- OpenHands feedback settings are explicitly marked “w. feedback & w.o. test func” (`patcheval/exp_agent/openhands/shells/run_all_exp2.sh:28`, `patcheval/exp_agent/openhands/shells/run_all_exp3.sh:28`).
- OpenHands check action also runs `bash /workspace/vul-run.sh` (`patcheval/exp_agent/openhands/openhands_diff.patch:6738`, `patcheval/exp_agent/openhands/openhands_diff.patch:8739`).

We understand that the final evaluation still includes unit tests when available (as discussed in Point 1); our concern here is that the iterative feedback signal appears to be PoC-only, i.e., unit-test feedback is not exposed to the agent even when such tests are available.

**Question:** Given that iterative repair in practice relies on both security and regression signals, does PoC-only feedback in S1.4/S2.2 deviate from the intended “realistic development workflow” positioning?

---

**3. Potential Test Leakage**

We appreciate the current protections around PoC artifacts. In particular, `test.patch` is moved out of the agent-visible workspace (e.g., to `/tmp/secret`) in the feedback setups (e.g., `patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:43`, and similarly in OpenHands feedback scripts). We also understand that workspace state is restored between checks, which prevents persistent workspace contamination from temporary `git apply` operations.

That said, this does not fully remove target leakage: even if agents cannot read the injected test code from `test.patch` after rollback, they can still read the exact PoC target selector from visible scripts.

- In [CVE-2015-1326](https://github.com/users/anonymous2578-data/packages/container/package/cve-2015-1326): `vul-run.sh:6` includes `tests/test_api.py::TestTemplates::test_local`.
- In [CVE-2015-8213](https://github.com/users/anonymous2578-data/packages/container/package/cve-2015-8213):  `vul-run.sh:6` includes `./runtests.py i18n.tests.FormattingTests.test_format_arbitrary_settings`.

By contrast, broader regression checks are in `unit_test.sh`:

- In [CVE-2015-1326](https://github.com/users/anonymous2578-data/packages/container/package/cve-2015-1326): `unit_test.sh:6` includes `tests/test_api.py`.
- In [CVE-2015-8213](https://github.com/users/anonymous2578-data/packages/container/package/cve-2015-8213): `unit_test.sh:6` includes `./runtests.py i18n.tests`.

So, while rollback/hiding helps with code-level contamination, the PoC target name itself is still visible and can act as optimization guidance for agents.

You already hide patch artifacts (e.g., moving `test.patch` to `/tmp/secret`), but do not apply the same visibility control to runner scripts such as `vul-run.sh`. As a result, PoC test selectors remain plaintext-accessible to agents.

A straightforward mitigation is to treat runner scripts as protected artifacts as well (hide/move them and expose only a black-box check_vulnerability interface), with (optional) output redaction to reduce leakage in logs.

Addition: CVE-specific path naming (e.g., `/workspace/PoC_env/CVE-2015-8213/...`) may unintentionally provide additional retrieval anchors for agents.

---

**4. Insufficient Documentation Regarding Agent Environment and Feedback**

The current documentation appears fragmented: while some agent-specific files (e.g., under `exp_agent`) include hints such as "Test feedback" / "w_test_wo_func", the top-level paper/README do not explicitly state that feedback in S1.4/S2.2 is PoC-oriented and does not provide unit-test feedback to agents.

**Question**: Could you clarify this in the top-level README (and optionally the paper artifact notes) by explicitly documenting the execution constraints, file visibility rules, and the feedback scope used in agent runs?

---

**5. Ambiguous Leaderboard Categorization (Missing the "Feedback" Dimension)**

<img width="1430" height="805" alt="Image" src="https://github.com/user-attachments/assets/04cc4539-1cf3-4ee3-a59b-85de8729e577" />

Captured: 2026/03/07 about 1 PM

<https://patcheval.github.io/>

The current public leaderboard categorizes results strictly under "Location Oracle" and "End-to-End". However, the paper's ablation study shows massive performance gaps based on the presence of feedback (e.g., SWE-Agent jumping from 32 in S2.1 to 87 in S2.2).

**Question:** Without specifying whether the leaderboard scores reflect the single-turn setting (S1.1/S2.1) or the multi-turn feedback setting (S1.4/S2.2), it is difficult for the community to accurately interpret the rankings. Could you please add explicit columns or tags on the leaderboard to distinguish between "Single-turn (No Feedback)" and "Multi-turn (With Feedback)"? Furthermore, it would be helpful to explicitly state on the leaderboard that "Feedback" currently implies *PoC-only feedback*.

---

**6. Missing Claude-on-Claude Baseline for the Claude Code Compatibility Claim**

The current interpretation of Claude Code’s poor cross-model performance seems under-supported because the paper does not report the most important control: a Claude Code run with a Claude-family model.

In arXiv v1, Section 4 explicitly says the agent study only pairs each agent with the top three standalone models (GPT-4.1, Gemini 2.5, and DeepSeek-R1), and Table 6 correspondingly includes only those three columns, yet the paper then attributes Claude Code’s degradation to possible cross-model incompatibility and says it may be optimized for Claude-series models.

The released artifact is consistent with that omission: the `patcheval/log/agent/claudecode` logs only contain Gemini/GPT/DeepSeek/Doubao runs, with no Claude-backed Claude Code result.

Moreover, the LaTeX source still contains commented-out Claude4 rows and even a commented sentence referring to Claude-Code-Claude4, which suggests this baseline may have existed at some point but was not included in the public version.

As of March 8, 2026, the live PatchEval leaderboard still shows Claude Code paired with Gemini 2.5, GPT-4.1, DeepSeek R1, and later GPT-5, but no Claude-family model.

Without an in-family Claude Code + Claude baseline, the claim that the observed degradation is mainly due to cross-model mismatch is difficult to validate; please either publish that baseline (if it exists) or soften this interpretation and present it more clearly as an untested hypothesis.

---

Thank you for your time and for contributing this massive dataset to the community. We look forward to your insights on these evaluation details!

Best regards,
Pser

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancies in Agent Evaluation: Unit Test Coverage, Asymmetric Feedback, Test Leakage, and Leaderboard Ambiguity #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discrepancies in Agent Evaluation: Unit Test Coverage, Asymmetric Feedback, Test Leakage, and Leaderboard Ambiguity #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions