Skip to content

[data] feat: add support for swe bench multilingual#55

Merged
yyDing1 merged 5 commits into
mainfrom
swe-bench-multilingual
Jun 7, 2026
Merged

[data] feat: add support for swe bench multilingual#55
yyDing1 merged 5 commits into
mainfrom
swe-bench-multilingual

Conversation

@yyDing1

@yyDing1 yyDing1 commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Add SWE-bench Multilingual support

Adds end-to-end support for the SWE-bench/SWE-bench_Multilingual dataset (300 instances, 7 languages: c/go/java/js/php/ruby/rust), graded via the official swebench harness.

Changes

  • Reward spec: new swe_bench_multilingual reward spec + registration in registry.py / reward/__init__.py.
  • Data preprocessing: examples/data_preprocess/swe_bench_multilingual.py — emits the SWE-agent format using the public swebench/sweb.eval.* images, and caps each Modal sandbox at 4 cores / 8 GiB via modal_sandbox_kwargs.
  • Modal deployment (deployment.py):
    • pass DEBIAN_FRONTEND=noninteractive to apt_install so non-Python eval images don't hang on a tzdata prompt during image build;
    • coerce cpu/memory list→tuple, since a parquet round-trip turns the (request, limit) tuples into lists that Modal would otherwise silently ignore;
    • raise default max-starting-per-worker 8→64.
  • swe_bench.py: reset only the modified test files to base_commit instead of git checkout master/main, fixing repos whose default branch isn't master/main.
  • Verify script (parallel_verify_swe.py): tqdm progress + summary, per-instance filtering, and per-case deployment config merged over defaults.
  • Docs: add a SWE-bench Multilingual results table.

Test plan

  • DEPLOYMENT=modal python examples/data_preprocess/swe_bench_multilingual.py produces a 300-row parquet.
  • Gold-patch verification via parallel_verify_swe.py resolves the dataset; remaining failures are upstream data/environment issues (e.g. unpinned Rust deps, network-dependent gradle bootstrap), not harness/spec bugs.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the SWE-bench Multilingual dataset, including a preprocessing script, a new reward specification, and registration of the multilingual reward. It also refactors the parallel verification script to support progress streaming, improves test file resetting in the standard SWE-bench reward, and tunes Modal deployment configurations. The review feedback highlights critical issues: class-level instantiation of 'asyncio.Semaphore' in the Ray actor will raise a 'RuntimeError' in Python 3.11+, using check='ignore' during patch application silently masks failures, and direct dictionary access on 'MAP_REPO_TO_EXT' risks a 'KeyError'.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

last_error: Exception | None = None
for cmd in commands:
try:
await self.env.communicate(cmd, check="ignore")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using check="ignore" here causes self.env.communicate to silently succeed even if the patch application command fails (returns a non-zero exit code). As a result, the loop will immediately log "Applied patch successfully!" and return, without ever trying the fallback patch application strategies or raising an error if all strategies fail.\n\nChange check="ignore" to check="raise" so that failed patch attempts raise a RuntimeError and trigger the fallback commands.

Suggested change
await self.env.communicate(cmd, check="ignore")
await self.env.communicate(cmd, check="raise")

Comment on lines +98 to 102
_semaphore = asyncio.Semaphore(max(1, GLOBAL_CONCURRENCY // NUM_WORKERS))

async def run_single(self, sample):
async with self._semaphore:
return await run_sample(sample)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Instantiating asyncio.Semaphore at the class level (outside of an active event loop) will raise a RuntimeError: no running event loop in Python 3.11+ during module import or class definition. Additionally, because Ray actors run in their own event loops, class-level semaphores can be bound to the wrong event loop, leading to RuntimeError: Task got Future attached to a different loop.\n\nTo fix this, lazily initialize the semaphore inside the actor's methods or inside the actor's initialization context.

    def __init__(self):\n        self._semaphore = None\n\n    async def run_single(self, sample):\n        if self._semaphore is None:\n            self._semaphore = asyncio.Semaphore(max(1, GLOBAL_CONCURRENCY // NUM_WORKERS))\n        async with self._semaphore:\n            return await run_sample(sample)

Comment thread examples/data_preprocess/swe_bench_multilingual.py
@yyDing1 yyDing1 merged commit 01d6c43 into main Jun 7, 2026
4 checks passed
@yyDing1 yyDing1 deleted the swe-bench-multilingual branch June 7, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant