Skip to content

[deployment] fix: harden Modal sandbox lifecycle at high concurrency#47

Merged
yyDing1 merged 3 commits into
verl-project:mainfrom
aoshen02:feat/modal-sandbox-lifecycle
Jun 4, 2026
Merged

[deployment] fix: harden Modal sandbox lifecycle at high concurrency#47
yyDing1 merged 3 commits into
verl-project:mainfrom
aoshen02:feat/modal-sandbox-lifecycle

Conversation

@aoshen02

Copy link
Copy Markdown
Collaborator

What does this PR do?

Three interlocked changes to ModalDeployment to survive multi-day RL
training at agent_concurrency >= 128. All three respond to distinct
failure modes hit on a Qwen3-235B SWE-bench campaign across ~1.4M
trajectories on Modal.

1) Cap concurrent sandboxes in the STARTING state.
Dominant failure at high concurrency is Runtime did not start within 299s — too many sandboxes simultaneously in Modal's runtime-start
pipeline, not the sandbox CREATE rate (Modal Support confirmed 30/s +
2k burst, far above our ~1-2/s). Adds an asyncio.Semaphore keyed on a
fleet-wide MODAL_MAX_STARTING (default 128), divided by
UNIAGENT_NUM_WORKERS to derive each worker's share — same pattern
already used for the per-worker agent_loop semaphore. The permit is
held from sandbox.create through runtime.create_session, then
released so the long tool-call body does not occupy a permit.

2) Bound a single trajectory's init wall-clock.
Pre-patch: max_retries=5 × startup_timeout=300s = up to 25 min of one
trajectory hogging a STARTING permit. Reduce max_retries to 2 and add
MODAL_INIT_WALL_BUDGET (default 900s = 15 min) as a hard cap.
asyncio.wait_for around each _start() prevents a hung
modal.Sandbox.create from blocking past the deadline. When the budget
is exhausted the trajectory raises; the outer agent_loop converts it
into a reward=0 masked sample.

3) Guarantee Modal sandbox termination on teardown.
Observed (round 12, 2026-05-18): self._runtime.close() raised
aiohttp.ServerDisconnectedError when the in-sandbox agent server had
already torn down its socket. Without try/except, stop() returned
early and self._sandbox.terminate.aio() never ran. After thousands of
trajectories ~847 sandboxes were leaked, hitting the account's
concurrent-sandbox cap and 100% failing new creates on subsequent runs.
Wraps each step; guarantees terminate is attempted (and retried once)
even when runtime.close fails.

Checklist Before Starting

Test

Unit tests added in tests/deployment/test_modal_starting_limiter.py.

pytest tests/deployment/test_modal_starting_limiter.py -v
# 12 passed in 64.76s

Test coverage:

  • Per-worker share math (MAX_STARTING / NUM_WORKERS, incl. clamp-to-1)
  • Singleton + lazy init of _STARTING_SEMA
  • max_retries=2 (proves regression from old =5)
  • Wall-budget short-circuits subsequent attempts when exhausted
  • asyncio.wait_for cancels a hung _start
  • Serialization across concurrent _start callers (peak ≤ permits)
  • Permit released on failure path (no leak → no deadlock)

Production validation: 6+ consecutive 8-hour rollouter sessions
without the "299s runtime not start" cluster-collapse we hit pre-patch;
sandbox leak count returned to ~0 (was steady at 50-100/hr).

API and Usage Example

Three new env-var knobs (all optional; defaults pre-tuned for 8 trays ×
agent_concurrency=360):

export UNIAGENT_NUM_WORKERS=8         # rollouter worker count
export MODAL_MAX_STARTING=128         # fleet-wide STARTING cap
export MODAL_INIT_WALL_BUDGET=900     # per-trajectory init seconds

No public API changes. Existing callers of ModalDeployment.start() /
.stop() work unchanged.

Design & Code Changes

  • _get_starting_semaphore() — lazy singleton, constructed inside the
    event loop on first use; per-worker share computed as
    max(1, MAX_STARTING // NUM_WORKERS).
  • _start() now wraps the entire create-tunnel-runtime block in
    async with _get_starting_semaphore(), releasing immediately after
    runtime.create_session returns.
  • start() retry loop: max_retries=5→2, exp-backoff cap 30s→10s,
    outer MODAL_INIT_WALL_BUDGET deadline, per-attempt
    asyncio.wait_for(_start(), timeout=max(60, remaining)).
  • stop(): each cleanup step is try/except-wrapped so a transient
    failure in runtime.close does not skip sandbox.terminate;
    terminate.aio() retried once if first attempt fails.

Checklist Before Submitting

  • Read the Contribute Guide
  • pre-commit run --files uni_agent/deployment/modal/deployment.py tests/deployment/test_modal_starting_limiter.py — all hooks passed (ruff, ruff format, mypy, compile-all)
  • Tests added (tests/deployment/test_modal_starting_limiter.py, 12 tests, runtime-agnostic — no Modal network calls)
  • AI assistance was used (Claude Code); the submitting human (@aoshen02) reviewed every changed line and ran the test suite.
  • Not duplicating any open PR (PR [deployment,docs] fix: select local OCI runtime endpoint by network path #44 changes local/deployment.py, not modal/)

Three interlocked changes to make ModalDeployment survive a multi-day
RL training run at agent_concurrency >= 128. All three are responses to
distinct failure modes we hit on a Qwen3-235B SWE-bench campaign across
~1.4M trajectories on Modal.

1) Cap concurrent sandboxes in the STARTING state.
   The dominant failure mode at high concurrency is "Runtime did not
   start within 299s" — too many sandboxes simultaneously in Modal's
   runtime-start pipeline, not the sandbox CREATE rate (Modal confirmed
   30/s + 2k burst, vastly above our ~1-2/s). Add an asyncio.Semaphore
   keyed on a fleet-wide MODAL_MAX_STARTING (default 128), divided by
   UNIAGENT_NUM_WORKERS to derive each worker's share — same pattern
   already used for the per-worker agent_loop semaphore. The permit is
   held from sandbox.create through runtime alive, then released so
   the long tool-call body does not occupy a permit.

2) Bound a single trajectory's init wall-clock.
   Pre-patch: max_retries=5 * startup_timeout=300s = up to 25 minutes
   of one trajectory hoarding a STARTING permit (or pre-patch, hogging
   the global Modal cold-start budget). Reduce max_retries to 2 and add
   MODAL_INIT_WALL_BUDGET (default 900s = 15 min) as a hard cap.
   asyncio.wait_for around each _start() attempt prevents a hung
   modal.Sandbox.create from blocking past the deadline. When the
   budget is exhausted the trajectory raises; the outer agent_loop
   converts it into a reward=0 masked sample.

3) Guarantee Modal sandbox termination on teardown.
   Observed (round 12, 2026-05-18): self._runtime.close() raised
   aiohttp.ServerDisconnectedError when the in-sandbox agent server
   had already torn down its socket. Without try/except, stop()
   returned early and self._sandbox.terminate.aio() never ran. After
   thousands of trajectories ~847 sandboxes were leaked on the Modal
   side, hitting the account's concurrent-sandbox cap and 100%
   failing new sandbox creates on subsequent runs. Wrap each step;
   guarantee terminate is attempted (and retried once) even when
   runtime.close fails.

Env-var knobs (all optional, defaults pre-tuned for our 8 trays *
agent_concurrency=360 campaign):
  UNIAGENT_NUM_WORKERS=8         (rollouter worker count)
  MODAL_MAX_STARTING=128         (fleet-wide STARTING cap)
  MODAL_INIT_WALL_BUDGET=900     (per-trajectory init seconds)

Test plan
  pytest tests/deployment/test_modal_starting_limiter.py -v
  12 tests covering:
    - per-worker share math (incl. clamp-to-1)
    - singleton + lazy init
    - max_retries=2 (not 5)
    - wall-budget short-circuits subsequent attempts
    - wait_for cancels a hung _start
    - serialization across concurrent _start callers
    - permit released on failure path
  All 12 passed in 65s.

Production validation: ran 6+ consecutive 8-hour rollouter sessions
without the "299s runtime not start" cluster-collapse pattern we hit
pre-patch; sandbox leak count returned to ~0 (was steady at 50-100/hr).

This PR includes AI assistance (Claude Code).

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a Modal cold-start fleet limiter to prevent sandbox startup timeouts under high concurrency. It introduces a lazy-initialized semaphore to cap concurrent starting sandboxes per worker, reduces start retries from 5 to 2, enforces a wall-clock budget, and makes sandbox teardown more robust to prevent resource leaks. A comprehensive test suite is also added. The review feedback recommends extracting the hardcoded 60-second timeout into a module-level constant to speed up unit tests, and defensively guarding against a potential division-by-zero error if the number of workers is misconfigured to zero.

Comment on lines +42 to +45
_NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8"))
_MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128"))
_INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900"))
_STARTING_SEMA: asyncio.Semaphore | None = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Introduce a module-level constant _MIN_ATTEMPT_TIMEOUT instead of hardcoding 60.0 seconds inside the start method. This allows unit tests to monkeypatch the timeout to a very small value, drastically reducing the test suite execution time from over 60 seconds to under a second.

Suggested change
_NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8"))
_MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128"))
_INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900"))
_STARTING_SEMA: asyncio.Semaphore | None = None
_NUM_WORKERS = int(os.getenv("UNIAGENT_NUM_WORKERS", "8"))
_MAX_STARTING_GLOBAL = int(os.getenv("MODAL_MAX_STARTING", "128"))
_INIT_WALL_BUDGET = float(os.getenv("MODAL_INIT_WALL_BUDGET", "900"))
_MIN_ATTEMPT_TIMEOUT = 60.0
_STARTING_SEMA: asyncio.Semaphore | None = None

Comment on lines +53 to +55
if _STARTING_SEMA is None:
per_worker = max(1, _MAX_STARTING_GLOBAL // _NUM_WORKERS)
_STARTING_SEMA = asyncio.Semaphore(per_worker)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defensively guard against ZeroDivisionError or unexpected behavior if _NUM_WORKERS is misconfigured to 0 or a negative value.

Suggested change
if _STARTING_SEMA is None:
per_worker = max(1, _MAX_STARTING_GLOBAL // _NUM_WORKERS)
_STARTING_SEMA = asyncio.Semaphore(per_worker)
if _STARTING_SEMA is None:
num_workers = max(1, _NUM_WORKERS)
per_worker = max(1, _MAX_STARTING_GLOBAL // num_workers)
_STARTING_SEMA = asyncio.Semaphore(per_worker)

break
try:
await self._start()
await asyncio.wait_for(self._start(), timeout=max(60.0, remaining))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the newly introduced _MIN_ATTEMPT_TIMEOUT constant instead of the hardcoded 60.0 value.

Suggested change
await asyncio.wait_for(self._start(), timeout=max(60.0, remaining))
await asyncio.wait_for(self._start(), timeout=max(_MIN_ATTEMPT_TIMEOUT, remaining))

Comment on lines +33 to +44
def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0):
"""Pin module constants to known values and force semaphore re-init.

Module-level constants are read at import; we mutate them with
monkeypatch.setattr so they revert after the test. `_STARTING_SEMA`
is the singleton cache -- clear it so the next call rebuilds with
the patched values.
"""
monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True)
monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True)
monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True)
monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update _reset_limiter_state to monkeypatch _MIN_ATTEMPT_TIMEOUT to a small default value (e.g., 0.1 seconds) so that tests involving timeouts do not block for 60 seconds.

Suggested change
def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0):
"""Pin module constants to known values and force semaphore re-init.
Module-level constants are read at import; we mutate them with
monkeypatch.setattr so they revert after the test. `_STARTING_SEMA`
is the singleton cache -- clear it so the next call rebuilds with
the patched values.
"""
monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True)
monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True)
monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True)
monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True)
def _reset_limiter_state(monkeypatch, *, num_workers=8, max_starting=128, wall_budget=900.0, min_attempt_timeout=0.1):
"""Pin module constants to known values and force semaphore re-init.
Module-level constants are read at import; we mutate them with
monkeypatch.setattr so they revert after the test. `_STARTING_SEMA`
is the singleton cache -- clear it so the next call rebuilds with
the patched values.
"""
monkeypatch.setattr(mod, "_NUM_WORKERS", num_workers, raising=True)
monkeypatch.setattr(mod, "_MAX_STARTING_GLOBAL", max_starting, raising=True)
monkeypatch.setattr(mod, "_INIT_WALL_BUDGET", wall_budget, raising=True)
monkeypatch.setattr(mod, "_MIN_ATTEMPT_TIMEOUT", min_attempt_timeout, raising=True)
monkeypatch.setattr(mod, "_STARTING_SEMA", None, raising=True)

Comment on lines +245 to +249
# Verify we don't actually wait the full 60s wait_for floor: that
# depends on Python's asyncio.wait_for behavior; with budget < 60 we
# rely on the OUTER loop's deadline check after attempt 1 to bail.
# NOTE: this test mainly proves no infinite hang.
assert elapsed < 90.0, f"start() must not hang forever, elapsed={elapsed:.1f}s"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

With the shortened _MIN_ATTEMPT_TIMEOUT, we can tighten the assertion to ensure that the timeout is triggered quickly (e.g., within 2 seconds) rather than waiting up to 90 seconds.

Suggested change
# Verify we don't actually wait the full 60s wait_for floor: that
# depends on Python's asyncio.wait_for behavior; with budget < 60 we
# rely on the OUTER loop's deadline check after attempt 1 to bail.
# NOTE: this test mainly proves no infinite hang.
assert elapsed < 90.0, f"start() must not hang forever, elapsed={elapsed:.1f}s"
# Verify we don't actually wait the full 60s wait_for floor: that
# depends on Python's asyncio.wait_for behavior; with budget < 60 we
# rely on the OUTER loop's deadline check after attempt 1 to bail.
# NOTE: this test mainly proves no infinite hang.
assert elapsed < 2.0, f"start() must not hang forever, elapsed={elapsed:.1f}s"

@yyDing1 yyDing1 merged commit 93f8c12 into verl-project:main Jun 4, 2026
3 checks passed
@yyDing1 yyDing1 deleted the feat/modal-sandbox-lifecycle branch June 4, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants