-
Notifications
You must be signed in to change notification settings - Fork 8
Discrepancies in Agent Evaluation: Unit Test Coverage, Asymmetric Feedback, Test Leakage, and Leaderboard Ambiguity #3
Description
Hi PatchEval Authors,
Thank you for releasing PatchEval. It is a highly valuable benchmark for automated vulnerability repair (AVR) with a great focus on real-world, multi-language CVEs and dynamic sandbox testing.
While diving into the evaluation framework and the container setups to understand how agents are evaluated (specifically in settings S1.4 and S2.2), we noticed a few critical discrepancies between the paper's claims and the actual codebase implementation. We would appreciate your clarifications on the following points:
1. Strict metric interpretability under missing unit tests
The paper frames successful repair under strict criteria and reports headline results over 230 cases (e.g., 23.0% (53/230) in arXiv-2511.11019v1/src/Introdution.tex:59 and arXiv-2511.11019v1/src/Evaluation.tex:119; benchmark overview also states 230 sandboxed CVEs in README.md:32 and docs/HUGGINGFACE_README.md:4).
However, based on patcheval_dataset.json metadata and evaluation logic:
- We counted from dataset fields: In the 230 sandboxed cases (
poc_test_cmd != null), 176 haveunit_test_cmdand 54 do not (derived frompoc_test_cmd/unit_test_cmd; schema explicitly allowsunit_test_cmdto be null indocs/HUGGINGFACE_README.md:19). - In evaluation code, missing
unit_test.shis treated as pass:- existence check:
patcheval/evaluation/run_evaluation.py:170 - if missing:
return True, "No unit-test"atpatcheval/evaluation/run_evaluation.py:173
- existence check:
- Strict success is then assigned when both PoC and
unittest_resultare truthy (patcheval/evaluation/run_evaluation.py:206topatcheval/evaluation/run_evaluation.py:207). - Final strict pass rate denominator is
total_patches = len(patchs)(patcheval/evaluation/run_evaluation.py:393) passed into strict summary (patcheval/evaluation/run_evaluation.py:395topatcheval/evaluation/run_evaluation.py:396), i.e., typically 230 in artifact runs.
Question: Since 54/230 cases have no unit-test script and are effectively treated as unit-pass in strict mode, are there plans to additionally report strict “PoC & Unit” metrics over only the 176 cases that actually support regression testing?
2. Asymmetric Feedback Loop in Agent Settings (S1.4 & S2.2)
The paper emphasizes feedback-driven iterative refinement, but in agent wrappers, feedback appears PoC-only:
- SWE-Agent feedback template instructs repeated
check_vulnerabilityruns (patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:25,patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:27). - SWE-Agent
check_vulnerabilitytool executesbash /workspace/vul-run.sh(patcheval/exp_agent/sweagent/sweagent_diff.patch:422), notunit_test.sh. - OpenHands feedback settings are explicitly marked “w. feedback & w.o. test func” (
patcheval/exp_agent/openhands/shells/run_all_exp2.sh:28,patcheval/exp_agent/openhands/shells/run_all_exp3.sh:28). - OpenHands check action also runs
bash /workspace/vul-run.sh(patcheval/exp_agent/openhands/openhands_diff.patch:6738,patcheval/exp_agent/openhands/openhands_diff.patch:8739).
We understand that the final evaluation still includes unit tests when available (as discussed in Point 1); our concern here is that the iterative feedback signal appears to be PoC-only, i.e., unit-test feedback is not exposed to the agent even when such tests are available.
Question: Given that iterative repair in practice relies on both security and regression signals, does PoC-only feedback in S1.4/S2.2 deviate from the intended “realistic development workflow” positioning?
3. Potential Test Leakage
We appreciate the current protections around PoC artifacts. In particular, test.patch is moved out of the agent-visible workspace (e.g., to /tmp/secret) in the feedback setups (e.g., patcheval/exp_agent/sweagent/configs/template_with_feedback.yaml:43, and similarly in OpenHands feedback scripts). We also understand that workspace state is restored between checks, which prevents persistent workspace contamination from temporary git apply operations.
That said, this does not fully remove target leakage: even if agents cannot read the injected test code from test.patch after rollback, they can still read the exact PoC target selector from visible scripts.
- In CVE-2015-1326:
vul-run.sh:6includestests/test_api.py::TestTemplates::test_local. - In CVE-2015-8213:
vul-run.sh:6includes./runtests.py i18n.tests.FormattingTests.test_format_arbitrary_settings.
By contrast, broader regression checks are in unit_test.sh:
- In CVE-2015-1326:
unit_test.sh:6includestests/test_api.py. - In CVE-2015-8213:
unit_test.sh:6includes./runtests.py i18n.tests.
So, while rollback/hiding helps with code-level contamination, the PoC target name itself is still visible and can act as optimization guidance for agents.
You already hide patch artifacts (e.g., moving test.patch to /tmp/secret), but do not apply the same visibility control to runner scripts such as vul-run.sh. As a result, PoC test selectors remain plaintext-accessible to agents.
A straightforward mitigation is to treat runner scripts as protected artifacts as well (hide/move them and expose only a black-box check_vulnerability interface), with (optional) output redaction to reduce leakage in logs.
Addition: CVE-specific path naming (e.g., /workspace/PoC_env/CVE-2015-8213/...) may unintentionally provide additional retrieval anchors for agents.
4. Insufficient Documentation Regarding Agent Environment and Feedback
The current documentation appears fragmented: while some agent-specific files (e.g., under exp_agent) include hints such as "Test feedback" / "w_test_wo_func", the top-level paper/README do not explicitly state that feedback in S1.4/S2.2 is PoC-oriented and does not provide unit-test feedback to agents.
Question: Could you clarify this in the top-level README (and optionally the paper artifact notes) by explicitly documenting the execution constraints, file visibility rules, and the feedback scope used in agent runs?
5. Ambiguous Leaderboard Categorization (Missing the "Feedback" Dimension)
Captured: 2026/03/07 about 1 PM
The current public leaderboard categorizes results strictly under "Location Oracle" and "End-to-End". However, the paper's ablation study shows massive performance gaps based on the presence of feedback (e.g., SWE-Agent jumping from 32 in S2.1 to 87 in S2.2).
Question: Without specifying whether the leaderboard scores reflect the single-turn setting (S1.1/S2.1) or the multi-turn feedback setting (S1.4/S2.2), it is difficult for the community to accurately interpret the rankings. Could you please add explicit columns or tags on the leaderboard to distinguish between "Single-turn (No Feedback)" and "Multi-turn (With Feedback)"? Furthermore, it would be helpful to explicitly state on the leaderboard that "Feedback" currently implies PoC-only feedback.
6. Missing Claude-on-Claude Baseline for the Claude Code Compatibility Claim
The current interpretation of Claude Code’s poor cross-model performance seems under-supported because the paper does not report the most important control: a Claude Code run with a Claude-family model.
In arXiv v1, Section 4 explicitly says the agent study only pairs each agent with the top three standalone models (GPT-4.1, Gemini 2.5, and DeepSeek-R1), and Table 6 correspondingly includes only those three columns, yet the paper then attributes Claude Code’s degradation to possible cross-model incompatibility and says it may be optimized for Claude-series models.
The released artifact is consistent with that omission: the patcheval/log/agent/claudecode logs only contain Gemini/GPT/DeepSeek/Doubao runs, with no Claude-backed Claude Code result.
Moreover, the LaTeX source still contains commented-out Claude4 rows and even a commented sentence referring to Claude-Code-Claude4, which suggests this baseline may have existed at some point but was not included in the public version.
As of March 8, 2026, the live PatchEval leaderboard still shows Claude Code paired with Gemini 2.5, GPT-4.1, DeepSeek R1, and later GPT-5, but no Claude-family model.
Without an in-family Claude Code + Claude baseline, the claim that the observed degradation is mainly due to cross-model mismatch is difficult to validate; please either publish that baseline (if it exists) or soften this interpretation and present it more clearly as an untested hypothesis.
Thank you for your time and for contributing this massive dataset to the community. We look forward to your insights on these evaluation details!
Best regards,
Pser