[deployment] fix: harden Modal sandbox lifecycle at high concurrency#47
Conversation
Three interlocked changes to make ModalDeployment survive a multi-day
RL training run at agent_concurrency >= 128. All three are responses to
distinct failure modes we hit on a Qwen3-235B SWE-bench campaign across
~1.4M trajectories on Modal.
1) Cap concurrent sandboxes in the STARTING state.
The dominant failure mode at high concurrency is "Runtime did not
start within 299s" — too many sandboxes simultaneously in Modal's
runtime-start pipeline, not the sandbox CREATE rate (Modal confirmed
30/s + 2k burst, vastly above our ~1-2/s). Add an asyncio.Semaphore
keyed on a fleet-wide MODAL_MAX_STARTING (default 128), divided by
UNIAGENT_NUM_WORKERS to derive each worker's share — same pattern
already used for the per-worker agent_loop semaphore. The permit is
held from sandbox.create through runtime alive, then released so
the long tool-call body does not occupy a permit.
2) Bound a single trajectory's init wall-clock.
Pre-patch: max_retries=5 * startup_timeout=300s = up to 25 minutes
of one trajectory hoarding a STARTING permit (or pre-patch, hogging
the global Modal cold-start budget). Reduce max_retries to 2 and add
MODAL_INIT_WALL_BUDGET (default 900s = 15 min) as a hard cap.
asyncio.wait_for around each _start() attempt prevents a hung
modal.Sandbox.create from blocking past the deadline. When the
budget is exhausted the trajectory raises; the outer agent_loop
converts it into a reward=0 masked sample.
3) Guarantee Modal sandbox termination on teardown.
Observed (round 12, 2026-05-18): self._runtime.close() raised
aiohttp.ServerDisconnectedError when the in-sandbox agent server
had already torn down its socket. Without try/except, stop()
returned early and self._sandbox.terminate.aio() never ran. After
thousands of trajectories ~847 sandboxes were leaked on the Modal
side, hitting the account's concurrent-sandbox cap and 100%
failing new sandbox creates on subsequent runs. Wrap each step;
guarantee terminate is attempted (and retried once) even when
runtime.close fails.
Env-var knobs (all optional, defaults pre-tuned for our 8 trays *
agent_concurrency=360 campaign):
UNIAGENT_NUM_WORKERS=8 (rollouter worker count)
MODAL_MAX_STARTING=128 (fleet-wide STARTING cap)
MODAL_INIT_WALL_BUDGET=900 (per-trajectory init seconds)
Test plan
pytest tests/deployment/test_modal_starting_limiter.py -v
12 tests covering:
- per-worker share math (incl. clamp-to-1)
- singleton + lazy init
- max_retries=2 (not 5)
- wall-budget short-circuits subsequent attempts
- wait_for cancels a hung _start
- serialization across concurrent _start callers
- permit released on failure path
All 12 passed in 65s.
Production validation: ran 6+ consecutive 8-hour rollouter sessions
without the "299s runtime not start" cluster-collapse pattern we hit
pre-patch; sandbox leak count returned to ~0 (was steady at 50-100/hr).
This PR includes AI assistance (Claude Code).
Signed-off-by: aoshen02 <aoshen@inferact.ai>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements a Modal cold-start fleet limiter to prevent sandbox startup timeouts under high concurrency. It introduces a lazy-initialized semaphore to cap concurrent starting sandboxes per worker, reduces start retries from 5 to 2, enforces a wall-clock budget, and makes sandbox teardown more robust to prevent resource leaks. A comprehensive test suite is also added. The review feedback recommends extracting the hardcoded 60-second timeout into a module-level constant to speed up unit tests, and defensively guarding against a potential division-by-zero error if the number of workers is misconfigured to zero.
| _NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8")) | ||
| _MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128")) | ||
| _INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900")) | ||
| _STARTING_SEMA: asyncio.Semaphore | None = None |
There was a problem hiding this comment.
Introduce a module-level constant _MIN_ATTEMPT_TIMEOUT instead of hardcoding 60.0 seconds inside the start method. This allows unit tests to monkeypatch the timeout to a very small value, drastically reducing the test suite execution time from over 60 seconds to under a second.
| _NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8")) | |
| _MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128")) | |
| _INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900")) | |
| _STARTING_SEMA: asyncio.Semaphore | None = None | |
| _NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8")) | |
| _MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128")) | |
| _INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900")) | |
| _MIN_ATTEMPT_TIMEOUT = 60.0 | |
| _STARTING_SEMA: asyncio.Semaphore | None = None |
| if _STARTING_SEMA is None: | ||
| per_worker = max(1, _MAX_STARTING_GLOBAL // _NUM_WORKERS) | ||
| _STARTING_SEMA = asyncio.Semaphore(per_worker) |
There was a problem hiding this comment.
Defensively guard against ZeroDivisionError or unexpected behavior if _NUM_WORKERS is misconfigured to 0 or a negative value.
| if _STARTING_SEMA is None: | |
| per_worker = max(1, _MAX_STARTING_GLOBAL // _NUM_WORKERS) | |
| _STARTING_SEMA = asyncio.Semaphore(per_worker) | |
| if _STARTING_SEMA is None: | |
| num_workers = max(1, _NUM_WORKERS) | |
| per_worker = max(1, _MAX_STARTING_GLOBAL // num_workers) | |
| _STARTING_SEMA = asyncio.Semaphore(per_worker) |
| break | ||
| try: | ||
| await self._start() | ||
| await asyncio.wait_for(self._start(), timeout=max(60.0, remaining)) |
There was a problem hiding this comment.
| def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0): | ||
| """Pin module constants to known values and force semaphore re-init. | ||
|
|
||
| Module-level constants are read at import; we mutate them with | ||
| monkeypatch.setattr so they revert after the test. `_STARTING_SEMA` | ||
| is the singleton cache -- clear it so the next call rebuilds with | ||
| the patched values. | ||
| """ | ||
| monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True) | ||
| monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True) | ||
| monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True) | ||
| monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True) |
There was a problem hiding this comment.
Update _reset_limiter_state to monkeypatch _MIN_ATTEMPT_TIMEOUT to a small default value (e.g., 0.1 seconds) so that tests involving timeouts do not block for 60 seconds.
| def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0): | |
| """Pin module constants to known values and force semaphore re-init. | |
| Module-level constants are read at import; we mutate them with | |
| monkeypatch.setattr so they revert after the test. `_STARTING_SEMA` | |
| is the singleton cache -- clear it so the next call rebuilds with | |
| the patched values. | |
| """ | |
| monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True) | |
| monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True) | |
| monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True) | |
| monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True) | |
| def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0, min_attempt_timeout=0.1): | |
| """Pin module constants to known values and force semaphore re-init. | |
| Module-level constants are read at import; we mutate them with | |
| monkeypatch.setattr so they revert after the test. `_STARTING_SEMA` | |
| is the singleton cache -- clear it so the next call rebuilds with | |
| the patched values. | |
| """ | |
| monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True) | |
| monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True) | |
| monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True) | |
| monkeypatch.setattr(mod, "_MIN_ATTEMPT_TIMEOUT", min_attempt_timeout, raising=True) | |
| monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True) |
| # Verify we don't actually wait the full 60s wait_for floor: that | ||
| # depends on Python's asyncio.wait_for behavior; with budget < 60 we | ||
| # rely on the OUTER loop's deadline check after attempt 1 to bail. | ||
| # NOTE: this test mainly proves no infinite hang. | ||
| assert elapsed < 90.0, f"start() must not hang forever, elapsed={elapsed:.1f}s" |
There was a problem hiding this comment.
With the shortened _MIN_ATTEMPT_TIMEOUT, we can tighten the assertion to ensure that the timeout is triggered quickly (e.g., within 2 seconds) rather than waiting up to 90 seconds.
| # Verify we don't actually wait the full 60s wait_for floor: that | |
| # depends on Python's asyncio.wait_for behavior; with budget < 60 we | |
| # rely on the OUTER loop's deadline check after attempt 1 to bail. | |
| # NOTE: this test mainly proves no infinite hang. | |
| assert elapsed < 90.0, f"start() must not hang forever, elapsed={elapsed:.1f}s" | |
| # Verify we don't actually wait the full 60s wait_for floor: that | |
| # depends on Python's asyncio.wait_for behavior; with budget < 60 we | |
| # rely on the OUTER loop's deadline check after attempt 1 to bail. | |
| # NOTE: this test mainly proves no infinite hang. | |
| assert elapsed < 2.0, f"start() must not hang forever, elapsed={elapsed:.1f}s" |
What does this PR do?
Three interlocked changes to
ModalDeploymentto survive multi-day RLtraining at
agent_concurrency >= 128. All three respond to distinctfailure modes hit on a Qwen3-235B SWE-bench campaign across ~1.4M
trajectories on Modal.
1) Cap concurrent sandboxes in the STARTING state.
Dominant failure at high concurrency is
Runtime did not start within 299s— too many sandboxes simultaneously in Modal's runtime-startpipeline, not the sandbox CREATE rate (Modal Support confirmed 30/s +
2k burst, far above our ~1-2/s). Adds an
asyncio.Semaphorekeyed on afleet-wide
MODAL_MAX_STARTING(default 128), divided byUNIAGENT_NUM_WORKERSto derive each worker's share — same patternalready used for the per-worker
agent_loopsemaphore. The permit isheld from
sandbox.createthroughruntime.create_session, thenreleased so the long tool-call body does not occupy a permit.
2) Bound a single trajectory's init wall-clock.
Pre-patch:
max_retries=5 × startup_timeout=300s = up to 25 minof onetrajectory hogging a STARTING permit. Reduce
max_retriesto 2 and addMODAL_INIT_WALL_BUDGET(default 900s = 15 min) as a hard cap.asyncio.wait_foraround each_start()prevents a hungmodal.Sandbox.createfrom blocking past the deadline. When the budgetis exhausted the trajectory raises; the outer
agent_loopconverts itinto a
reward=0masked sample.3) Guarantee Modal sandbox termination on teardown.
Observed (round 12, 2026-05-18):
self._runtime.close()raisedaiohttp.ServerDisconnectedErrorwhen the in-sandbox agent server hadalready torn down its socket. Without
try/except,stop()returnedearly and
self._sandbox.terminate.aio()never ran. After thousands oftrajectories ~847 sandboxes were leaked, hitting the account's
concurrent-sandbox cap and 100% failing new creates on subsequent runs.
Wraps each step; guarantees terminate is attempted (and retried once)
even when
runtime.closefails.Checklist Before Starting
gh pr list --repo verl-project/uni-agent --state open --search "modal"→ none touchingmodal/deployment.py(PR [deployment,docs] fix: select local OCI runtime endpoint by network path #44 toucheslocal/deployment.pyonly)gh issue list --search "modal sandbox"→ only RFC-level ([RFC] Add OpenYuanrong & AKernel sandbox system as deployment backend #31 OpenYuanrong, Support on local deployment for SWE-Bench. #43 local SWE-Bench), nothing on STARTING-state limiter[deployment] fix: ...(single module, single type)Test
Unit tests added in
tests/deployment/test_modal_starting_limiter.py.pytest tests/deployment/test_modal_starting_limiter.py -v # 12 passed in 64.76sTest coverage:
MAX_STARTING / NUM_WORKERS, incl. clamp-to-1)_STARTING_SEMAmax_retries=2(proves regression from old=5)asyncio.wait_forcancels a hung_start_startcallers (peak ≤ permits)Production validation: 6+ consecutive 8-hour rollouter sessions
without the "299s runtime not start" cluster-collapse we hit pre-patch;
sandbox leak count returned to ~0 (was steady at 50-100/hr).
API and Usage Example
Three new env-var knobs (all optional; defaults pre-tuned for 8 trays ×
agent_concurrency=360):No public API changes. Existing callers of
ModalDeployment.start()/.stop()work unchanged.Design & Code Changes
_get_starting_semaphore()— lazy singleton, constructed inside theevent loop on first use; per-worker share computed as
max(1, MAX_STARTING // NUM_WORKERS)._start()now wraps the entire create-tunnel-runtime block inasync with _get_starting_semaphore(), releasing immediately afterruntime.create_sessionreturns.start()retry loop:max_retries=5→2, exp-backoff cap30s→10s,outer
MODAL_INIT_WALL_BUDGETdeadline, per-attemptasyncio.wait_for(_start(), timeout=max(60, remaining)).stop(): each cleanup step istry/except-wrapped so a transientfailure in
runtime.closedoes not skipsandbox.terminate;terminate.aio()retried once if first attempt fails.Checklist Before Submitting
pre-commit run --files uni_agent/deployment/modal/deployment.py tests/deployment/test_modal_starting_limiter.py— all hooks passed (ruff, ruff format, mypy, compile-all)tests/deployment/test_modal_starting_limiter.py, 12 tests, runtime-agnostic — no Modal network calls)local/deployment.py, notmodal/)