[deployment,docs] fix: select local OCI runtime endpoint by network path#44
[deployment,docs] fix: select local OCI runtime endpoint by network path#44C23333 wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the runtime_port configuration and refactors how the OCI container runtime endpoint is resolved based on the network configuration. The review feedback correctly identifies two issues in the _get_runtime_endpoint resolution logic: a bug when connecting from within a container using a custom network, and a cross-platform compatibility issue when running on a host with a custom network (such as on macOS/Windows where container IPs are not routable). A code suggestion is provided to ensure that container IP and runtime_port are only used when actually running inside a container or using host networking.
| def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]: | ||
| if self._config.host: | ||
| return self._config.host | ||
| port = self._config.runtime_port if network == _HOST_NETWORK else published_port | ||
| return self._config.host, port | ||
|
|
||
| if network == _HOST_NETWORK: | ||
| return "http://127.0.0.1", self._config.runtime_port | ||
|
|
||
| if _is_running_in_container() or self._config.network: | ||
| if network or _is_running_in_container(): | ||
| container_ip = self._get_container_ip(container_name) | ||
| if container_ip: | ||
| return f"http://{container_ip}" | ||
| return f"http://{container_ip}", self._config.runtime_port | ||
|
|
||
| return "http://127.0.0.1" | ||
| return "http://127.0.0.1", published_port |
There was a problem hiding this comment.
There are two issues in the port and host resolution logic within _get_runtime_endpoint:
-
Bug when
self._config.hostis set inside a container:
If Uni-Agent is running inside a container (_is_running_in_container()isTrue) andnetworkis not"host"(e.g., it isNoneor a custom bridge network),network == _HOST_NETWORKevaluates toFalse. This causes the code to selectpublished_port. However, container-to-container communication bypasses port publishing and must connect directly to the target container'sruntime_port. Attempting to connect topublished_portwill fail. -
Cross-platform compatibility issue when running on the host with a custom network:
If Uni-Agent is running on the host (_is_running_in_container()isFalse) but a customnetworkis specified,network or _is_running_in_container()evaluates toTrue. This forces the use ofcontainer_ipandruntime_port. On macOS and Windows (Docker Desktop), container IPs are not routable from the host, causing the connection to fail. Since the port is published to the host's localhost anyway, we should connect via127.0.0.1:published_portwhen running on the host.
Suggested Fix
We can resolve both issues by ensuring we only use runtime_port and container_ip when Uni-Agent is actually running inside a container (or when using host networking).
Note: If you apply this change, you will need to update the following tests in tests/deployment/test_local_deployment.py to mock _is_running_in_container to return True (since they simulate container-to-container network paths):
test_oci_runtime_uses_container_port_when_connecting_over_container_networktest_oci_runtime_falls_back_to_published_port_when_container_ip_is_unavailable
| def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]: | |
| if self._config.host: | |
| return self._config.host | |
| port = self._config.runtime_port if network == _HOST_NETWORK else published_port | |
| return self._config.host, port | |
| if network == _HOST_NETWORK: | |
| return "http://127.0.0.1", self._config.runtime_port | |
| if _is_running_in_container() or self._config.network: | |
| if network or _is_running_in_container(): | |
| container_ip = self._get_container_ip(container_name) | |
| if container_ip: | |
| return f"http://{container_ip}" | |
| return f"http://{container_ip}", self._config.runtime_port | |
| return "http://127.0.0.1" | |
| return "http://127.0.0.1", published_port | |
| def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]: | |
| if self._config.host: | |
| port = self._config.runtime_port if (network == _HOST_NETWORK or _is_running_in_container()) else published_port | |
| return self._config.host, port | |
| if network == _HOST_NETWORK: | |
| return "http://127.0.0.1", self._config.runtime_port | |
| if _is_running_in_container(): | |
| container_ip = self._get_container_ip(container_name) | |
| if container_ip: | |
| return f"http://{container_ip}", self._config.runtime_port | |
| return "http://127.0.0.1", published_port |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the local deployment logic to support dynamic runtime port selection depending on whether the agent is running inside a container, on host networking, or over a custom OCI network. It also adds support for detecting Podman containers via /run/.containerenv and includes extensive unit tests. The reviewer feedback points out that when running inside a container, if the host is explicitly set to a loopback address (such as host.docker.internal or localhost), the connection back to the host should use the published port instead of the internal runtime port. The reviewer provides a code suggestion to fix this and requests an additional test case to verify the behavior.
| if self._config.host: | ||
| return self._config.host | ||
| port = ( | ||
| self._config.runtime_port if network == _HOST_NETWORK or _is_running_in_container() else published_port | ||
| ) | ||
| return self._config.host, port |
There was a problem hiding this comment.
When _is_running_in_container() is True and self._config.host is explicitly set to a host-loopback address (such as host.docker.internal, localhost, or 127.0.0.1), the connection from within the container back to the host must go through the host's published port rather than the container's internal runtime_port. Currently, the code incorrectly selects runtime_port for all cases where _is_running_in_container() is True.
| if self._config.host: | |
| return self._config.host | |
| port = ( | |
| self._config.runtime_port if network == _HOST_NETWORK or _is_running_in_container() else published_port | |
| ) | |
| return self._config.host, port | |
| if self._config.host: | |
| is_host_loopback = any(h in self._config.host for h in ("localhost", "127.0.0.1", "0.0.0.0", "host.docker.internal")) | |
| port = ( | |
| self._config.runtime_port | |
| if network == _HOST_NETWORK or (_is_running_in_container() and not is_host_loopback) | |
| else published_port | |
| ) | |
| return self._config.host, port |
| def test_oci_runtime_uses_container_port_with_explicit_host_from_container(monkeypatch): | ||
| deployment = LocalDeployment( | ||
| run_id="test", | ||
| type="local", | ||
| container_runtime="docker", | ||
| image="python:3.12", | ||
| host="http://sandbox", | ||
| runtime_port=8000, | ||
| network="agent-net", | ||
| ) | ||
| deployment._runtime_exec = lambda args, check=True: _completed_container() | ||
| monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True) | ||
| monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime) | ||
| _CapturingRuntime.configs = [] | ||
|
|
||
| asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567)) | ||
|
|
||
| runtime_config, run_id = _CapturingRuntime.configs[-1] | ||
| assert run_id == "test" | ||
| assert runtime_config.host == "http://sandbox" | ||
| assert runtime_config.port == 8000 |
There was a problem hiding this comment.
Add a test case to verify that when connecting from within a container back to the host via a host-loopback address (like host.docker.internal), the runtime correctly uses the published port instead of the container's internal runtime_port.
def test_oci_runtime_uses_container_port_with_explicit_host_from_container(monkeypatch):
deployment = LocalDeployment(
run_id="test",
type="local",
container_runtime="docker",
image="python:3.12",
host="http://sandbox",
runtime_port=8000,
network="agent-net",
)
deployment._runtime_exec = lambda args, check=True: _completed_container()
monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True)
monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime)
_CapturingRuntime.configs = []
asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567))
runtime_config, run_id = _CapturingRuntime.configs[-1]
assert run_id == "test"
assert runtime_config.host == "http://sandbox"
assert runtime_config.port == 8000
def test_oci_runtime_uses_published_port_with_host_loopback_from_container(monkeypatch):
deployment = LocalDeployment(
run_id="test",
type="local",
container_runtime="docker",
image="python:3.12",
host="http://host.docker.internal",
runtime_port=8000,
network="agent-net",
)
deployment._runtime_exec = lambda args, check=True: _completed_container()
monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True)
monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime)
_CapturingRuntime.configs = []
asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567))
runtime_config, run_id = _CapturingRuntime.configs[-1]
assert run_id == "test"
assert runtime_config.host == "http://host.docker.internal"
assert runtime_config.port == 4567|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the local deployment logic to correctly handle port and host resolution when running Uni-Agent inside or outside of OCI containers (Docker/Podman), particularly with custom or host networks. It introduces a new runtime_port configuration option and refactors the runtime endpoint selection. The review feedback correctly points out that loopback addresses (such as localhost and 127.0.0.1) should be treated as host gateway names when Uni-Agent runs inside a container to prevent incorrect port selection and connection failures, and suggests corresponding test updates.
| _APPTAINER_RUNTIMES = {"apptainer", "singularity"} | ||
| _CONTAINER_RUNTIME_ENV_VARS = ("UNI_AGENT_CONTAINER_RUNTIME", "LOCAL_CONTAINER_RUNTIME") | ||
| _DEFAULT_CONTAINER_RUNTIME_CANDIDATES = ("apptainer", "singularity", "docker", "podman") | ||
| _HOST_GATEWAY_NAMES = {"host.docker.internal", "host.containers.internal"} |
There was a problem hiding this comment.
Loopback addresses like localhost, 127.0.0.1, 0.0.0.0, and ::1 should be treated as host gateway names when Uni-Agent is running inside a container (e.g., with host networking). If they are not included in _HOST_GATEWAY_NAMES, the runtime endpoint will incorrectly select runtime_port instead of published_port when connecting to the host via loopback, causing connection failures.
_HOST_GATEWAY_NAMES = {
"host.docker.internal",
"host.containers.internal",
"localhost",
"127.0.0.1",
"0.0.0.0",
"::1",
}|
|
||
|
|
||
| def test_oci_runtime_uses_published_port_with_host_gateway_from_container(monkeypatch): | ||
| for host in ("http://host.docker.internal", "http://host.containers.internal"): |
There was a problem hiding this comment.
Update the test to also verify that loopback addresses (localhost, 127.0.0.1, etc.) are correctly identified as host gateway names and use the published port when connecting from a container.
| for host in ("http://host.docker.internal", "http://host.containers.internal"): | |
| for host in ("http://host.docker.internal", "http://host.containers.internal", "http://localhost", "http://127.0.0.1"): |
There was a problem hiding this comment.
host.docker.internal and host.containers.internal are host-gateway names, so they should use the published port.
localhost, 127.0.0.1, 0.0.0.0, and ::1 are different. In a normal bridge network, loopback points to the current container, not the host, and 0.0.0.0 is mainly a bind address rather than a client endpoint. So they are intentionally not treated as host-gateway names here.
|
Could you help review this PR? @Jianbing-D |
…47) ### What does this PR do? Three interlocked changes to `ModalDeployment` to survive multi-day RL training at `agent_concurrency >= 128`. All three respond to distinct failure modes hit on a Qwen3-235B SWE-bench campaign across ~1.4M trajectories on Modal. **1) Cap concurrent sandboxes in the STARTING state.** Dominant failure at high concurrency is `Runtime did not start within 299s` — too many sandboxes simultaneously in Modal's runtime-start pipeline, not the sandbox CREATE rate (Modal Support confirmed 30/s + 2k burst, far above our ~1-2/s). Adds an `asyncio.Semaphore` keyed on a fleet-wide `MODAL_MAX_STARTING` (default 128), divided by `UNIAGENT_NUM_WORKERS` to derive each worker's share — same pattern already used for the per-worker `agent_loop` semaphore. The permit is held from `sandbox.create` through `runtime.create_session`, then released so the long tool-call body does not occupy a permit. **2) Bound a single trajectory's init wall-clock.** Pre-patch: `max_retries=5 × startup_timeout=300s = up to 25 min` of one trajectory hogging a STARTING permit. Reduce `max_retries` to 2 and add `MODAL_INIT_WALL_BUDGET` (default 900s = 15 min) as a hard cap. `asyncio.wait_for` around each `_start()` prevents a hung `modal.Sandbox.create` from blocking past the deadline. When the budget is exhausted the trajectory raises; the outer `agent_loop` converts it into a `reward=0` masked sample. **3) Guarantee Modal sandbox termination on teardown.** Observed (round 12, 2026-05-18): `self._runtime.close()` raised `aiohttp.ServerDisconnectedError` when the in-sandbox agent server had already torn down its socket. Without `try/except`, `stop()` returned early and `self._sandbox.terminate.aio()` never ran. After thousands of trajectories ~847 sandboxes were leaked, hitting the account's concurrent-sandbox cap and 100% failing new creates on subsequent runs. Wraps each step; guarantees terminate is attempted (and retried once) even when `runtime.close` fails. ### Checklist Before Starting - [x] Search for similar PRs/issues: - `gh pr list --repo verl-project/uni-agent --state open --search "modal"` → none touching `modal/deployment.py` (PR #44 touches `local/deployment.py` only) - `gh issue list --search "modal sandbox"` → only RFC-level (#31 OpenYuanrong, #43 local SWE-Bench), nothing on STARTING-state limiter - [x] Format the PR title as `[deployment] fix: ...` (single module, single type) ### Test Unit tests added in `tests/deployment/test_modal_starting_limiter.py`. ```bash pytest tests/deployment/test_modal_starting_limiter.py -v # 12 passed in 64.76s ``` Test coverage: - Per-worker share math (`MAX_STARTING / NUM_WORKERS`, incl. clamp-to-1) - Singleton + lazy init of `_STARTING_SEMA` - `max_retries=2` (proves regression from old `=5`) - Wall-budget short-circuits subsequent attempts when exhausted - `asyncio.wait_for` cancels a hung `_start` - Serialization across concurrent `_start` callers (peak ≤ permits) - Permit released on failure path (no leak → no deadlock) **Production validation**: 6+ consecutive 8-hour rollouter sessions without the "299s runtime not start" cluster-collapse we hit pre-patch; sandbox leak count returned to ~0 (was steady at 50-100/hr). ### API and Usage Example Three new env-var knobs (all optional; defaults pre-tuned for 8 trays × `agent_concurrency=360`): ```bash export UNIAGENT_NUM_WORKERS=8 # rollouter worker count export MODAL_MAX_STARTING=128 # fleet-wide STARTING cap export MODAL_INIT_WALL_BUDGET=900 # per-trajectory init seconds ``` No public API changes. Existing callers of `ModalDeployment.start()` / `.stop()` work unchanged. ### Design & Code Changes - `_get_starting_semaphore()` — lazy singleton, constructed inside the event loop on first use; per-worker share computed as `max(1, MAX_STARTING // NUM_WORKERS)`. - `_start()` now wraps the entire create-tunnel-runtime block in `async with _get_starting_semaphore()`, releasing immediately after `runtime.create_session` returns. - `start()` retry loop: `max_retries=5→2`, exp-backoff cap `30s→10s`, outer `MODAL_INIT_WALL_BUDGET` deadline, per-attempt `asyncio.wait_for(_start(), timeout=max(60, remaining))`. - `stop()`: each cleanup step is `try/except`-wrapped so a transient failure in `runtime.close` does not skip `sandbox.terminate`; `terminate.aio()` retried once if first attempt fails. ### Checklist Before Submitting - [x] Read the Contribute Guide - [x] `pre-commit run --files uni_agent/deployment/modal/deployment.py tests/deployment/test_modal_starting_limiter.py` — all hooks passed (ruff, ruff format, mypy, compile-all) - [x] Tests added (`tests/deployment/test_modal_starting_limiter.py`, 12 tests, runtime-agnostic — no Modal network calls) - [x] AI assistance was used (Claude Code); the submitting human (@aoshen02) reviewed every changed line and ran the test suite. - [x] Not duplicating any open PR (PR #44 changes `local/deployment.py`, not `modal/`) --------- Signed-off-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: yyDing1 <yyding.me@gmail.com>
What does this PR do?
This PR fixes how
LocalDeploymentbuilds theRemoteRuntimeConfigendpoint for Docker/Podman sandboxes.For Docker/Podman, the sandbox server listens on
runtime_portinside the container. When Uni-Agent connects from the host through port publishing, it must connect topublished_port.Example:
In this case:
runtime_port = 8000: the port whereswerex.serverlistens inside the sandbox containerpublished_port = 4567: the host-side port that Uni-Agent connects toThe previous implementation always used
runtime_portfor the runtime client. That breaks the default host-side Docker/Podman path because Uni-Agent would try127.0.0.1:8000instead of127.0.0.1:4567.This PR resolves the runtime endpoint from the actual Docker/Podman access path:
127.0.0.1:published_portruntime_portruntime_portand skips port publishinghostfollows the same port rule based on whether Uni-Agent is running on the host or inside a containerIt also detects the Podman container marker at
/run/.containerenvwhen checking whether Uni-Agent itself is running inside a container.Why is this needed?
Local Docker/Podman deployment has more than one valid access path.
Host-side port publishing:
Container-to-container access:
Host networking:
Using a single port for all paths makes one of these cases wrong. This PR keeps those paths separate in
LocalDeploymentand adds regression coverage for the Docker/Podman endpoint matrix.Checklist Before Starting
[{modules}] {type}: {description}Test
python -m pytest tests/deployment/test_local_deployment.py -qpython -m ruff check uni_agent/deployment/local/deployment.py tests/deployment/test_local_deployment.pypython -m ruff format --check uni_agent/deployment/local/deployment.py tests/deployment/test_local_deployment.pygit diff --check