[data] feat: add support for swe bench multilingual#55
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the SWE-bench Multilingual dataset, including a preprocessing script, a new reward specification, and registration of the multilingual reward. It also refactors the parallel verification script to support progress streaming, improves test file resetting in the standard SWE-bench reward, and tunes Modal deployment configurations. The review feedback highlights critical issues: class-level instantiation of 'asyncio.Semaphore' in the Ray actor will raise a 'RuntimeError' in Python 3.11+, using check='ignore' during patch application silently masks failures, and direct dictionary access on 'MAP_REPO_TO_EXT' risks a 'KeyError'.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| last_error: Exception | None = None | ||
| for cmd in commands: | ||
| try: | ||
| await self.env.communicate(cmd, check="ignore") |
There was a problem hiding this comment.
Using check="ignore" here causes self.env.communicate to silently succeed even if the patch application command fails (returns a non-zero exit code). As a result, the loop will immediately log "Applied patch successfully!" and return, without ever trying the fallback patch application strategies or raising an error if all strategies fail.\n\nChange check="ignore" to check="raise" so that failed patch attempts raise a RuntimeError and trigger the fallback commands.
| await self.env.communicate(cmd, check="ignore") | |
| await self.env.communicate(cmd, check="raise") |
| _semaphore = asyncio.Semaphore(max(1, GLOBAL_CONCURRENCY // NUM_WORKERS)) | ||
|
|
||
| async def run_single(self, sample): | ||
| async with self._semaphore: | ||
| return await run_sample(sample) |
There was a problem hiding this comment.
Instantiating asyncio.Semaphore at the class level (outside of an active event loop) will raise a RuntimeError: no running event loop in Python 3.11+ during module import or class definition. Additionally, because Ray actors run in their own event loops, class-level semaphores can be bound to the wrong event loop, leading to RuntimeError: Task got Future attached to a different loop.\n\nTo fix this, lazily initialize the semaphore inside the actor's methods or inside the actor's initialization context.
def __init__(self):\n self._semaphore = None\n\n async def run_single(self, sample):\n if self._semaphore is None:\n self._semaphore = asyncio.Semaphore(max(1, GLOBAL_CONCURRENCY // NUM_WORKERS))\n async with self._semaphore:\n return await run_sample(sample)
Add SWE-bench Multilingual support
Adds end-to-end support for the SWE-bench/SWE-bench_Multilingual dataset (300 instances, 7 languages: c/go/java/js/php/ruby/rust), graded via the official
swebenchharness.Changes
swe_bench_multilingualreward spec + registration inregistry.py/reward/__init__.py.examples/data_preprocess/swe_bench_multilingual.py— emits the SWE-agent format using the publicswebench/sweb.eval.*images, and caps each Modal sandbox at 4 cores / 8 GiB viamodal_sandbox_kwargs.deployment.py):DEBIAN_FRONTEND=noninteractivetoapt_installso non-Python eval images don't hang on a tzdata prompt during image build;cpu/memorylist→tuple, since a parquet round-trip turns the(request, limit)tuples into lists that Modal would otherwise silently ignore;base_commitinstead ofgit checkout master/main, fixing repos whose default branch isn't master/main.parallel_verify_swe.py): tqdm progress + summary, per-instance filtering, and per-case deployment config merged over defaults.Test plan
DEPLOYMENT=modal python examples/data_preprocess/swe_bench_multilingual.pyproduces a 300-row parquet.parallel_verify_swe.pyresolves the dataset; remaining failures are upstream data/environment issues (e.g. unpinned Rust deps, network-dependent gradle bootstrap), not harness/spec bugs.