[deployment] fix: harden Modal sandbox lifecycle at high concurrency by aoshen02 · Pull Request #47 · verl-project/uni-agent

aoshen02 · 2026-05-28T13:50:11Z

What does this PR do?

Three interlocked changes to ModalDeployment to survive multi-day RL
training at agent_concurrency >= 128. All three respond to distinct
failure modes hit on a Qwen3-235B SWE-bench campaign across ~1.4M
trajectories on Modal.

1) Cap concurrent sandboxes in the STARTING state.
Dominant failure at high concurrency is Runtime did not start within 299s — too many sandboxes simultaneously in Modal's runtime-start
pipeline, not the sandbox CREATE rate (Modal Support confirmed 30/s +
2k burst, far above our ~1-2/s). Adds an asyncio.Semaphore keyed on a
fleet-wide MODAL_MAX_STARTING (default 128), divided by
UNIAGENT_NUM_WORKERS to derive each worker's share — same pattern
already used for the per-worker agent_loop semaphore. The permit is
held from sandbox.create through runtime.create_session, then
released so the long tool-call body does not occupy a permit.

2) Bound a single trajectory's init wall-clock.
Pre-patch: max_retries=5 × startup_timeout=300s = up to 25 min of one
trajectory hogging a STARTING permit. Reduce max_retries to 2 and add
MODAL_INIT_WALL_BUDGET (default 900s = 15 min) as a hard cap.
asyncio.wait_for around each _start() prevents a hung
modal.Sandbox.create from blocking past the deadline. When the budget
is exhausted the trajectory raises; the outer agent_loop converts it
into a reward=0 masked sample.

3) Guarantee Modal sandbox termination on teardown.
Observed (round 12, 2026-05-18): self._runtime.close() raised
aiohttp.ServerDisconnectedError when the in-sandbox agent server had
already torn down its socket. Without try/except, stop() returned
early and self._sandbox.terminate.aio() never ran. After thousands of
trajectories ~847 sandboxes were leaked, hitting the account's
concurrent-sandbox cap and 100% failing new creates on subsequent runs.
Wraps each step; guarantees terminate is attempted (and retried once)
even when runtime.close fails.

Checklist Before Starting

Search for similar PRs/issues:
- gh pr list --repo verl-project/uni-agent --state open --search "modal" → none touching modal/deployment.py (PR [deployment,docs] fix: select local OCI runtime endpoint by network path #44 touches local/deployment.py only)
- gh issue list --search "modal sandbox" → only RFC-level ([RFC] Add OpenYuanrong & AKernel sandbox system as deployment backend #31 OpenYuanrong, Support on local deployment for SWE-Bench. #43 local SWE-Bench), nothing on STARTING-state limiter
Format the PR title as [deployment] fix: ... (single module, single type)

Test

Unit tests added in tests/deployment/test_modal_starting_limiter.py.

pytest tests/deployment/test_modal_starting_limiter.py -v
# 12 passed in 64.76s

Test coverage:

Per-worker share math (MAX_STARTING / NUM_WORKERS, incl. clamp-to-1)
Singleton + lazy init of _STARTING_SEMA
max_retries=2 (proves regression from old =5)
Wall-budget short-circuits subsequent attempts when exhausted
asyncio.wait_for cancels a hung _start
Serialization across concurrent _start callers (peak ≤ permits)
Permit released on failure path (no leak → no deadlock)

Production validation: 6+ consecutive 8-hour rollouter sessions
without the "299s runtime not start" cluster-collapse we hit pre-patch;
sandbox leak count returned to ~0 (was steady at 50-100/hr).

API and Usage Example

Three new env-var knobs (all optional; defaults pre-tuned for 8 trays ×
agent_concurrency=360):

export UNIAGENT_NUM_WORKERS=8         # rollouter worker count
export MODAL_MAX_STARTING=128         # fleet-wide STARTING cap
export MODAL_INIT_WALL_BUDGET=900     # per-trajectory init seconds

No public API changes. Existing callers of ModalDeployment.start() /
.stop() work unchanged.

Design & Code Changes

_get_starting_semaphore() — lazy singleton, constructed inside the
event loop on first use; per-worker share computed as
max(1, MAX_STARTING // NUM_WORKERS).
_start() now wraps the entire create-tunnel-runtime block in
async with _get_starting_semaphore(), releasing immediately after
runtime.create_session returns.
start() retry loop: max_retries=5→2, exp-backoff cap 30s→10s,
outer MODAL_INIT_WALL_BUDGET deadline, per-attempt
asyncio.wait_for(_start(), timeout=max(60, remaining)).
stop(): each cleanup step is try/except-wrapped so a transient
failure in runtime.close does not skip sandbox.terminate;
terminate.aio() retried once if first attempt fails.

Checklist Before Submitting

Read the Contribute Guide
pre-commit run --files uni_agent/deployment/modal/deployment.py tests/deployment/test_modal_starting_limiter.py — all hooks passed (ruff, ruff format, mypy, compile-all)
Tests added (tests/deployment/test_modal_starting_limiter.py, 12 tests, runtime-agnostic — no Modal network calls)
AI assistance was used (Claude Code); the submitting human (@aoshen02) reviewed every changed line and ran the test suite.
Not duplicating any open PR (PR [deployment,docs] fix: select local OCI runtime endpoint by network path #44 changes local/deployment.py, not modal/)

Three interlocked changes to make ModalDeployment survive a multi-day RL training run at agent_concurrency >= 128. All three are responses to distinct failure modes we hit on a Qwen3-235B SWE-bench campaign across ~1.4M trajectories on Modal. 1) Cap concurrent sandboxes in the STARTING state. The dominant failure mode at high concurrency is "Runtime did not start within 299s" — too many sandboxes simultaneously in Modal's runtime-start pipeline, not the sandbox CREATE rate (Modal confirmed 30/s + 2k burst, vastly above our ~1-2/s). Add an asyncio.Semaphore keyed on a fleet-wide MODAL_MAX_STARTING (default 128), divided by UNIAGENT_NUM_WORKERS to derive each worker's share — same pattern already used for the per-worker agent_loop semaphore. The permit is held from sandbox.create through runtime alive, then released so the long tool-call body does not occupy a permit. 2) Bound a single trajectory's init wall-clock. Pre-patch: max_retries=5 * startup_timeout=300s = up to 25 minutes of one trajectory hoarding a STARTING permit (or pre-patch, hogging the global Modal cold-start budget). Reduce max_retries to 2 and add MODAL_INIT_WALL_BUDGET (default 900s = 15 min) as a hard cap. asyncio.wait_for around each _start() attempt prevents a hung modal.Sandbox.create from blocking past the deadline. When the budget is exhausted the trajectory raises; the outer agent_loop converts it into a reward=0 masked sample. 3) Guarantee Modal sandbox termination on teardown. Observed (round 12, 2026-05-18): self._runtime.close() raised aiohttp.ServerDisconnectedError when the in-sandbox agent server had already torn down its socket. Without try/except, stop() returned early and self._sandbox.terminate.aio() never ran. After thousands of trajectories ~847 sandboxes were leaked on the Modal side, hitting the account's concurrent-sandbox cap and 100% failing new sandbox creates on subsequent runs. Wrap each step; guarantee terminate is attempted (and retried once) even when runtime.close fails. Env-var knobs (all optional, defaults pre-tuned for our 8 trays * agent_concurrency=360 campaign): UNIAGENT_NUM_WORKERS=8 (rollouter worker count) MODAL_MAX_STARTING=128 (fleet-wide STARTING cap) MODAL_INIT_WALL_BUDGET=900 (per-trajectory init seconds) Test plan pytest tests/deployment/test_modal_starting_limiter.py -v 12 tests covering: - per-worker share math (incl. clamp-to-1) - singleton + lazy init - max_retries=2 (not 5) - wall-budget short-circuits subsequent attempts - wait_for cancels a hung _start - serialization across concurrent _start callers - permit released on failure path All 12 passed in 65s. Production validation: ran 6+ consecutive 8-hour rollouter sessions without the "299s runtime not start" cluster-collapse pattern we hit pre-patch; sandbox leak count returned to ~0 (was steady at 50-100/hr). This PR includes AI assistance (Claude Code). Signed-off-by: aoshen02 <aoshen@inferact.ai> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements a Modal cold-start fleet limiter to prevent sandbox startup timeouts under high concurrency. It introduces a lazy-initialized semaphore to cap concurrent starting sandboxes per worker, reduces start retries from 5 to 2, enforces a wall-clock budget, and makes sandbox teardown more robust to prevent resource leaks. A comprehensive test suite is also added. The review feedback recommends extracting the hardcoded 60-second timeout into a module-level constant to speed up unit tests, and defensively guarding against a potential division-by-zero error if the number of workers is misconfigured to zero.

gemini-code-assist · 2026-05-28T13:51:50Z

+_NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8"))
+_MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128"))
+_INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900"))
+_STARTING_SEMA: asyncio.Semaphore | None = None


Introduce a module-level constant _MIN_ATTEMPT_TIMEOUT instead of hardcoding 60.0 seconds inside the start method. This allows unit tests to monkeypatch the timeout to a very small value, drastically reducing the test suite execution time from over 60 seconds to under a second.

Suggested change

_NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8"))

_MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128"))

_INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900"))

_STARTING_SEMA: asyncio.Semaphore | None = None

_NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8"))

_MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128"))

_INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900"))

_MIN_ATTEMPT_TIMEOUT = 60.0

_STARTING_SEMA: asyncio.Semaphore | None = None

gemini-code-assist · 2026-05-28T13:51:50Z

+    if _STARTING_SEMA is None:
+        per_worker = max(1, _MAX_STARTING_GLOBAL // _NUM_WORKERS)
+        _STARTING_SEMA = asyncio.Semaphore(per_worker)


Defensively guard against ZeroDivisionError or unexpected behavior if _NUM_WORKERS is misconfigured to 0 or a negative value.

Suggested change

if _STARTING_SEMA is None:

per_worker = max(1, _MAX_STARTING_GLOBAL // _NUM_WORKERS)

_STARTING_SEMA = asyncio.Semaphore(per_worker)

if _STARTING_SEMA is None:

num_workers = max(1, _NUM_WORKERS)

per_worker = max(1, _MAX_STARTING_GLOBAL // num_workers)

_STARTING_SEMA = asyncio.Semaphore(per_worker)

gemini-code-assist · 2026-05-28T13:51:50Z

+                break
            try:
-                await self._start()
+                await asyncio.wait_for(self._start(), timeout=max(60.0, remaining))


Use the newly introduced _MIN_ATTEMPT_TIMEOUT constant instead of the hardcoded 60.0 value.

Suggested change

await asyncio.wait_for(self._start(), timeout=max(60.0, remaining))

await asyncio.wait_for(self._start(), timeout=max(_MIN_ATTEMPT_TIMEOUT, remaining))

gemini-code-assist · 2026-05-28T13:51:51Z

+def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0):
+    """Pin module constants to known values and force semaphore re-init.
+
+    Module-level constants are read at import; we mutate them with
+    monkeypatch.setattr so they revert after the test. `_STARTING_SEMA`
+    is the singleton cache -- clear it so the next call rebuilds with
+    the patched values.
+    """
+    monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True)
+    monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True)
+    monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True)
+    monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True)


Update _reset_limiter_state to monkeypatch _MIN_ATTEMPT_TIMEOUT to a small default value (e.g., 0.1 seconds) so that tests involving timeouts do not block for 60 seconds.

Suggested change

def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0):

"""Pin module constants to known values and force semaphore re-init.

Module-level constants are read at import; we mutate them with

monkeypatch.setattr so they revert after the test. `_STARTING_SEMA`

is the singleton cache -- clear it so the next call rebuilds with

the patched values.

"""

monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True)

monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True)

monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True)

monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True)

def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0, min_attempt_timeout=0.1):

"""Pin module constants to known values and force semaphore re-init.

Module-level constants are read at import; we mutate them with

monkeypatch.setattr so they revert after the test. `_STARTING_SEMA`

is the singleton cache -- clear it so the next call rebuilds with

the patched values.

"""

monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True)

monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True)

monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True)

monkeypatch.setattr(mod, "_MIN_ATTEMPT_TIMEOUT", min_attempt_timeout, raising=True)

monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True)

gemini-code-assist · 2026-05-28T13:51:51Z

+    # Verify we don't actually wait the full 60s wait_for floor: that
+    # depends on Python's asyncio.wait_for behavior; with budget < 60 we
+    # rely on the OUTER loop's deadline check after attempt 1 to bail.
+    # NOTE: this test mainly proves no infinite hang.
+    assert elapsed < 90.0, f"start() must not hang forever, elapsed={elapsed:.1f}s"


With the shortened _MIN_ATTEMPT_TIMEOUT, we can tighten the assertion to ensure that the timeout is triggered quickly (e.g., within 2 seconds) rather than waiting up to 90 seconds.

Suggested change

# Verify we don't actually wait the full 60s wait_for floor: that

# depends on Python's asyncio.wait_for behavior; with budget < 60 we

# rely on the OUTER loop's deadline check after attempt 1 to bail.

# NOTE: this test mainly proves no infinite hang.

assert elapsed < 90.0, f"start() must not hang forever, elapsed={elapsed:.1f}s"

# Verify we don't actually wait the full 60s wait_for floor: that

# depends on Python's asyncio.wait_for behavior; with budget < 60 we

# rely on the OUTER loop's deadline check after attempt 1 to bail.

# NOTE: this test mainly proves no infinite hang.

assert elapsed < 2.0, f"start() must not hang forever, elapsed={elapsed:.1f}s"

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

yyDing1 added 2 commits June 4, 2026 15:38

update

7e7df30

update

ec6134d

yyDing1 merged commit 93f8c12 into verl-project:main Jun 4, 2026
3 checks passed

yyDing1 deleted the feat/modal-sandbox-lifecycle branch June 4, 2026 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deployment] fix: harden Modal sandbox lifecycle at high concurrency#47

[deployment] fix: harden Modal sandbox lifecycle at high concurrency#47
yyDing1 merged 3 commits into
verl-project:mainfrom
aoshen02:feat/modal-sandbox-lifecycle

aoshen02 commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	await asyncio.wait_for(self._start(), timeout=max(60.0, remaining))
	await asyncio.wait_for(self._start(), timeout=max(_MIN_ATTEMPT_TIMEOUT, remaining))

Conversation

aoshen02 commented May 28, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants