Skip to content

Discrepancies in Agent Evaluation: Unit Test Coverage, Asymmetric Feedback, Test Leakage, and Leaderboard Ambiguity #3

@xuanhao44

Description

@xuanhao44

Hi PatchEval Authors,

Thank you for releasing PatchEval. It is a highly valuable benchmark for automated vulnerability repair (AVR) with a great focus on real-world, multi-language CVEs and dynamic sandbox testing.

While diving into the evaluation framework and the container setups to understand how agents are evaluated (specifically in settings S1.4 and S2.2), we noticed a few critical discrepancies between the paper's claims and the actual codebase implementation. We would appreciate your clarifications on the following points:

1. Strict metric interpretability under missing unit tests

The paper frames successful repair under strict criteria and reports headline results over 230 cases (e.g., 23.0% (53/230) in arXiv-2511.11019v1/src/Introdution.tex:59 and arXiv-2511.11019v1/src/Evaluation.tex:119; benchmark overview also states 230 sandboxed CVEs in README.md:32 and docs/HUGGINGFACE_README.md:4).

However, based on patcheval_dataset.json metadata and evaluation logic:

  • We counted from dataset fields: In the 230 sandboxed cases (poc_test_cmd != null), 176 have unit_test_cmd and 54 do not (derived from poc_test_cmd/unit_test_cmd; schema explicitly allows unit_test_cmd to be null in docs/HUGGINGFACE_README.md:19).
  • In evaluation code, missing unit_test.sh is treated as pass:
    • existence check: patcheval/evaluation/run_evaluation.py:170
    • if missing: return True, "No unit-test" at patcheval/evaluation/run_evaluation.py:173
  • Strict success is then assigned when both PoC and unittest_result are truthy (patcheval/evaluation/run_evaluation.py:206 to patcheval/evaluation/run_evaluation.py:207).
  • Final strict pass rate denominator is total_patches = len(patchs) (patcheval/evaluation/run_evaluation.py:393) passed into strict summary (patcheval/evaluation/run_evaluation.py:395 to patcheval/evaluation/run_evaluation.py:396), i.e., typically 230 in artifact runs.

Question: Since 54/230 cases have no unit-test script and are effectively treated as unit-pass in strict mode, are there plans to additionally report strict “PoC & Unit” metrics over only the 176 cases that actually support regression testing?


2. Asymmetric Feedback Loop in Agent Settings (S1.4 & S2.2)

The paper emphasizes feedback-driven iterative refinement, but in agent wrappers, feedback appears PoC-only:

  • SWE-Agent feedback template instructs repeated check_vulnerability runs (patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:25, patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:27).
  • SWE-Agent check_vulnerability tool executes bash /workspace/vul-run.sh (patcheval/exp_agent/sweagent/sweagent_diff.patch:422), not unit_test.sh.
  • OpenHands feedback settings are explicitly marked “w. feedback & w.o. test func” (patcheval/exp_agent/openhands/shells/run_all_exp2.sh:28, patcheval/exp_agent/openhands/shells/run_all_exp3.sh:28).
  • OpenHands check action also runs bash /workspace/vul-run.sh (patcheval/exp_agent/openhands/openhands_diff.patch:6738, patcheval/exp_agent/openhands/openhands_diff.patch:8739).

We understand that the final evaluation still includes unit tests when available (as discussed in Point 1); our concern here is that the iterative feedback signal appears to be PoC-only, i.e., unit-test feedback is not exposed to the agent even when such tests are available.

Question: Given that iterative repair in practice relies on both security and regression signals, does PoC-only feedback in S1.4/S2.2 deviate from the intended “realistic development workflow” positioning?


3. Potential Test Leakage

We appreciate the current protections around PoC artifacts. In particular, test.patch is moved out of the agent-visible workspace (e.g., to /tmp/secret) in the feedback setups (e.g., patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:43, and similarly in OpenHands feedback scripts). We also understand that workspace state is restored between checks, which prevents persistent workspace contamination from temporary git apply operations.

That said, this does not fully remove target leakage: even if agents cannot read the injected test code from test.patch after rollback, they can still read the exact PoC target selector from visible scripts.

  • In CVE-2015-1326: vul-run.sh:6 includes tests/test_api.py::TestTemplates::test_local.
  • In CVE-2015-8213: vul-run.sh:6 includes ./runtests.py i18n.tests.FormattingTests.test_format_arbitrary_settings.

By contrast, broader regression checks are in unit_test.sh:

  • In CVE-2015-1326: unit_test.sh:6 includes tests/test_api.py.
  • In CVE-2015-8213: unit_test.sh:6 includes ./runtests.py i18n.tests.

So, while rollback/hiding helps with code-level contamination, the PoC target name itself is still visible and can act as optimization guidance for agents.

You already hide patch artifacts (e.g., moving test.patch to /tmp/secret), but do not apply the same visibility control to runner scripts such as vul-run.sh. As a result, PoC test selectors remain plaintext-accessible to agents.

A straightforward mitigation is to treat runner scripts as protected artifacts as well (hide/move them and expose only a black-box check_vulnerability interface), with (optional) output redaction to reduce leakage in logs.

Addition: CVE-specific path naming (e.g., /workspace/PoC_env/CVE-2015-8213/...) may unintentionally provide additional retrieval anchors for agents.


4. Insufficient Documentation Regarding Agent Environment and Feedback

The current documentation appears fragmented: while some agent-specific files (e.g., under exp_agent) include hints such as "Test feedback" / "w_test_wo_func", the top-level paper/README do not explicitly state that feedback in S1.4/S2.2 is PoC-oriented and does not provide unit-test feedback to agents.

Question: Could you clarify this in the top-level README (and optionally the paper artifact notes) by explicitly documenting the execution constraints, file visibility rules, and the feedback scope used in agent runs?


5. Ambiguous Leaderboard Categorization (Missing the "Feedback" Dimension)

Image

Captured: 2026/03/07 about 1 PM

https://patcheval.github.io/

The current public leaderboard categorizes results strictly under "Location Oracle" and "End-to-End". However, the paper's ablation study shows massive performance gaps based on the presence of feedback (e.g., SWE-Agent jumping from 32 in S2.1 to 87 in S2.2).

Question: Without specifying whether the leaderboard scores reflect the single-turn setting (S1.1/S2.1) or the multi-turn feedback setting (S1.4/S2.2), it is difficult for the community to accurately interpret the rankings. Could you please add explicit columns or tags on the leaderboard to distinguish between "Single-turn (No Feedback)" and "Multi-turn (With Feedback)"? Furthermore, it would be helpful to explicitly state on the leaderboard that "Feedback" currently implies PoC-only feedback.


6. Missing Claude-on-Claude Baseline for the Claude Code Compatibility Claim

The current interpretation of Claude Code’s poor cross-model performance seems under-supported because the paper does not report the most important control: a Claude Code run with a Claude-family model.

In arXiv v1, Section 4 explicitly says the agent study only pairs each agent with the top three standalone models (GPT-4.1, Gemini 2.5, and DeepSeek-R1), and Table 6 correspondingly includes only those three columns, yet the paper then attributes Claude Code’s degradation to possible cross-model incompatibility and says it may be optimized for Claude-series models.

The released artifact is consistent with that omission: the patcheval/log/agent/claudecode logs only contain Gemini/GPT/DeepSeek/Doubao runs, with no Claude-backed Claude Code result.

Moreover, the LaTeX source still contains commented-out Claude4 rows and even a commented sentence referring to Claude-Code-Claude4, which suggests this baseline may have existed at some point but was not included in the public version.

As of March 8, 2026, the live PatchEval leaderboard still shows Claude Code paired with Gemini 2.5, GPT-4.1, DeepSeek R1, and later GPT-5, but no Claude-family model.

Without an in-family Claude Code + Claude baseline, the claim that the observed degradation is mainly due to cross-model mismatch is difficult to validate; please either publish that baseline (if it exists) or soften this interpretation and present it more clearly as an untested hypothesis.


Thank you for your time and for contributing this massive dataset to the community. We look forward to your insights on these evaluation details!

Best regards,
Pser

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions