Skip to content

[deployment,docs] fix: select local OCI runtime endpoint by network path#44

Open
C23333 wants to merge 3 commits into
verl-project:mainfrom
C23333:codex/fix-local-oci-published-port
Open

[deployment,docs] fix: select local OCI runtime endpoint by network path#44
C23333 wants to merge 3 commits into
verl-project:mainfrom
C23333:codex/fix-local-oci-published-port

Conversation

@C23333
Copy link
Copy Markdown
Contributor

@C23333 C23333 commented May 27, 2026

What does this PR do?

This PR fixes how LocalDeployment builds the RemoteRuntimeConfig endpoint for Docker/Podman sandboxes.

For Docker/Podman, the sandbox server listens on runtime_port inside the container. When Uni-Agent connects from the host through port publishing, it must connect to published_port.

Example:

docker run -p 4567:8000 ...

In this case:

  • runtime_port = 8000: the port where swerex.server listens inside the sandbox container
  • published_port = 4567: the host-side port that Uni-Agent connects to

The previous implementation always used runtime_port for the runtime client. That breaks the default host-side Docker/Podman path because Uni-Agent would try 127.0.0.1:8000 instead of 127.0.0.1:4567.

This PR resolves the runtime endpoint from the actual Docker/Podman access path:

  • host-side port publishing uses 127.0.0.1:published_port
  • Uni-Agent running inside a container can use the sandbox container IP with runtime_port
  • host networking uses runtime_port and skips port publishing
  • explicit host follows the same port rule based on whether Uni-Agent is running on the host or inside a container

It also detects the Podman container marker at /run/.containerenv when checking whether Uni-Agent itself is running inside a container.

Why is this needed?

Local Docker/Podman deployment has more than one valid access path.

Host-side port publishing:

Uni-Agent on host -> http://127.0.0.1:4567 -> sandbox:8000

Container-to-container access:

Uni-Agent container -> http://172.18.0.9:8000

Host networking:

Uni-Agent -> http://127.0.0.1:8000

Using a single port for all paths makes one of these cases wrong. This PR keeps those paths separate in LocalDeployment and adds regression coverage for the Docker/Podman endpoint matrix.

Checklist Before Starting

  • PR title follows [{modules}] {type}: {description}
  • Related issue / PR: N/A

Test

  • python -m pytest tests/deployment/test_local_deployment.py -q
22 passed
  • python -m ruff check uni_agent/deployment/local/deployment.py tests/deployment/test_local_deployment.py
All checks passed!
  • python -m ruff format --check uni_agent/deployment/local/deployment.py tests/deployment/test_local_deployment.py
2 files already formatted
  • git diff --check
passed

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the runtime_port configuration and refactors how the OCI container runtime endpoint is resolved based on the network configuration. The review feedback correctly identifies two issues in the _get_runtime_endpoint resolution logic: a bug when connecting from within a container using a custom network, and a cross-platform compatibility issue when running on a host with a custom network (such as on macOS/Windows where container IPs are not routable). A code suggestion is provided to ensure that container IP and runtime_port are only used when actually running inside a container or using host networking.

Comment on lines +190 to +203
def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]:
if self._config.host:
return self._config.host
port = self._config.runtime_port if network == _HOST_NETWORK else published_port
return self._config.host, port

if network == _HOST_NETWORK:
return "http://127.0.0.1", self._config.runtime_port

if _is_running_in_container() or self._config.network:
if network or _is_running_in_container():
container_ip = self._get_container_ip(container_name)
if container_ip:
return f"http://{container_ip}"
return f"http://{container_ip}", self._config.runtime_port

return "http://127.0.0.1"
return "http://127.0.0.1", published_port
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two issues in the port and host resolution logic within _get_runtime_endpoint:

  1. Bug when self._config.host is set inside a container:
    If Uni-Agent is running inside a container (_is_running_in_container() is True) and network is not "host" (e.g., it is None or a custom bridge network), network == _HOST_NETWORK evaluates to False. This causes the code to select published_port. However, container-to-container communication bypasses port publishing and must connect directly to the target container's runtime_port. Attempting to connect to published_port will fail.

  2. Cross-platform compatibility issue when running on the host with a custom network:
    If Uni-Agent is running on the host (_is_running_in_container() is False) but a custom network is specified, network or _is_running_in_container() evaluates to True. This forces the use of container_ip and runtime_port. On macOS and Windows (Docker Desktop), container IPs are not routable from the host, causing the connection to fail. Since the port is published to the host's localhost anyway, we should connect via 127.0.0.1:published_port when running on the host.

Suggested Fix

We can resolve both issues by ensuring we only use runtime_port and container_ip when Uni-Agent is actually running inside a container (or when using host networking).

Note: If you apply this change, you will need to update the following tests in tests/deployment/test_local_deployment.py to mock _is_running_in_container to return True (since they simulate container-to-container network paths):

  • test_oci_runtime_uses_container_port_when_connecting_over_container_network
  • test_oci_runtime_falls_back_to_published_port_when_container_ip_is_unavailable
Suggested change
def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]:
if self._config.host:
return self._config.host
port = self._config.runtime_port if network == _HOST_NETWORK else published_port
return self._config.host, port
if network == _HOST_NETWORK:
return "http://127.0.0.1", self._config.runtime_port
if _is_running_in_container() or self._config.network:
if network or _is_running_in_container():
container_ip = self._get_container_ip(container_name)
if container_ip:
return f"http://{container_ip}"
return f"http://{container_ip}", self._config.runtime_port
return "http://127.0.0.1"
return "http://127.0.0.1", published_port
def _get_runtime_endpoint(self, container_name: str, published_port: int, network: str | None) -> tuple[str, int]:
if self._config.host:
port = self._config.runtime_port if (network == _HOST_NETWORK or _is_running_in_container()) else published_port
return self._config.host, port
if network == _HOST_NETWORK:
return "http://127.0.0.1", self._config.runtime_port
if _is_running_in_container():
container_ip = self._get_container_ip(container_name)
if container_ip:
return f"http://{container_ip}", self._config.runtime_port
return "http://127.0.0.1", published_port

@C23333 C23333 marked this pull request as draft May 27, 2026 07:47
@C23333 C23333 changed the title [deployment,docs] fix: select local OCI runtime port by host [deployment,docs] fix: select local OCI runtime endpoint by network path May 28, 2026
@C23333 C23333 marked this pull request as ready for review May 28, 2026 03:16
@C23333
Copy link
Copy Markdown
Contributor Author

C23333 commented May 28, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the local deployment logic to support dynamic runtime port selection depending on whether the agent is running inside a container, on host networking, or over a custom OCI network. It also adds support for detecting Podman containers via /run/.containerenv and includes extensive unit tests. The reviewer feedback points out that when running inside a container, if the host is explicitly set to a loopback address (such as host.docker.internal or localhost), the connection back to the host should use the published port instead of the internal runtime port. The reviewer provides a code suggestion to fix this and requests an additional test case to verify the behavior.

Comment on lines 191 to +195
if self._config.host:
return self._config.host
port = (
self._config.runtime_port if network == _HOST_NETWORK or _is_running_in_container() else published_port
)
return self._config.host, port
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When _is_running_in_container() is True and self._config.host is explicitly set to a host-loopback address (such as host.docker.internal, localhost, or 127.0.0.1), the connection from within the container back to the host must go through the host's published port rather than the container's internal runtime_port. Currently, the code incorrectly selects runtime_port for all cases where _is_running_in_container() is True.

Suggested change
if self._config.host:
return self._config.host
port = (
self._config.runtime_port if network == _HOST_NETWORK or _is_running_in_container() else published_port
)
return self._config.host, port
if self._config.host:
is_host_loopback = any(h in self._config.host for h in ("localhost", "127.0.0.1", "0.0.0.0", "host.docker.internal"))
port = (
self._config.runtime_port
if network == _HOST_NETWORK or (_is_running_in_container() and not is_host_loopback)
else published_port
)
return self._config.host, port

Comment on lines +235 to +255
def test_oci_runtime_uses_container_port_with_explicit_host_from_container(monkeypatch):
deployment = LocalDeployment(
run_id="test",
type="local",
container_runtime="docker",
image="python:3.12",
host="http://sandbox",
runtime_port=8000,
network="agent-net",
)
deployment._runtime_exec = lambda args, check=True: _completed_container()
monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True)
monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime)
_CapturingRuntime.configs = []

asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567))

runtime_config, run_id = _CapturingRuntime.configs[-1]
assert run_id == "test"
assert runtime_config.host == "http://sandbox"
assert runtime_config.port == 8000
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a test case to verify that when connecting from within a container back to the host via a host-loopback address (like host.docker.internal), the runtime correctly uses the published port instead of the container's internal runtime_port.

def test_oci_runtime_uses_container_port_with_explicit_host_from_container(monkeypatch):
    deployment = LocalDeployment(
        run_id="test",
        type="local",
        container_runtime="docker",
        image="python:3.12",
        host="http://sandbox",
        runtime_port=8000,
        network="agent-net",
    )
    deployment._runtime_exec = lambda args, check=True: _completed_container()
    monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True)
    monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime)
    _CapturingRuntime.configs = []

    asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567))

    runtime_config, run_id = _CapturingRuntime.configs[-1]
    assert run_id == "test"
    assert runtime_config.host == "http://sandbox"
    assert runtime_config.port == 8000


def test_oci_runtime_uses_published_port_with_host_loopback_from_container(monkeypatch):
    deployment = LocalDeployment(
        run_id="test",
        type="local",
        container_runtime="docker",
        image="python:3.12",
        host="http://host.docker.internal",
        runtime_port=8000,
        network="agent-net",
    )
    deployment._runtime_exec = lambda args, check=True: _completed_container()
    monkeypatch.setattr(local_deployment, "_is_running_in_container", lambda: True)
    monkeypatch.setattr(local_deployment, "LocalRuntime", _CapturingRuntime)
    _CapturingRuntime.configs = []

    asyncio.run(deployment._start_oci_container(token="secret", container_name="sandbox-name", published_port=4567))

    runtime_config, run_id = _CapturingRuntime.configs[-1]
    assert run_id == "test"
    assert runtime_config.host == "http://host.docker.internal"
    assert runtime_config.port == 4567

@C23333 C23333 marked this pull request as draft May 28, 2026 03:30
@C23333
Copy link
Copy Markdown
Contributor Author

C23333 commented May 28, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the local deployment logic to correctly handle port and host resolution when running Uni-Agent inside or outside of OCI containers (Docker/Podman), particularly with custom or host networks. It introduces a new runtime_port configuration option and refactors the runtime endpoint selection. The review feedback correctly points out that loopback addresses (such as localhost and 127.0.0.1) should be treated as host gateway names when Uni-Agent runs inside a container to prevent incorrect port selection and connection failures, and suggests corresponding test updates.

_APPTAINER_RUNTIMES = {"apptainer", "singularity"}
_CONTAINER_RUNTIME_ENV_VARS = ("UNI_AGENT_CONTAINER_RUNTIME", "LOCAL_CONTAINER_RUNTIME")
_DEFAULT_CONTAINER_RUNTIME_CANDIDATES = ("apptainer", "singularity", "docker", "podman")
_HOST_GATEWAY_NAMES = {"host.docker.internal", "host.containers.internal"}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Loopback addresses like localhost, 127.0.0.1, 0.0.0.0, and ::1 should be treated as host gateway names when Uni-Agent is running inside a container (e.g., with host networking). If they are not included in _HOST_GATEWAY_NAMES, the runtime endpoint will incorrectly select runtime_port instead of published_port when connecting to the host via loopback, causing connection failures.

_HOST_GATEWAY_NAMES = {
    "host.docker.internal",
    "host.containers.internal",
    "localhost",
    "127.0.0.1",
    "0.0.0.0",
    "::1",
}



def test_oci_runtime_uses_published_port_with_host_gateway_from_container(monkeypatch):
for host in ("http://host.docker.internal", "http://host.containers.internal"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update the test to also verify that loopback addresses (localhost, 127.0.0.1, etc.) are correctly identified as host gateway names and use the published port when connecting from a container.

Suggested change
for host in ("http://host.docker.internal", "http://host.containers.internal"):
for host in ("http://host.docker.internal", "http://host.containers.internal", "http://localhost", "http://127.0.0.1"):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host.docker.internal and host.containers.internal are host-gateway names, so they should use the published port.

localhost, 127.0.0.1, 0.0.0.0, and ::1 are different. In a normal bridge network, loopback points to the current container, not the host, and 0.0.0.0 is mainly a bind address rather than a client endpoint. So they are intentionally not treated as host-gateway names here.

@C23333 C23333 marked this pull request as ready for review May 28, 2026 05:11
@yyDing1
Copy link
Copy Markdown
Collaborator

yyDing1 commented May 29, 2026

Could you help review this PR? @Jianbing-D

yyDing1 added a commit that referenced this pull request Jun 4, 2026
…47)

### What does this PR do?

Three interlocked changes to `ModalDeployment` to survive multi-day RL
training at `agent_concurrency >= 128`. All three respond to distinct
failure modes hit on a Qwen3-235B SWE-bench campaign across ~1.4M
trajectories on Modal.

**1) Cap concurrent sandboxes in the STARTING state.**
Dominant failure at high concurrency is `Runtime did not start within
299s` — too many sandboxes simultaneously in Modal's runtime-start
pipeline, not the sandbox CREATE rate (Modal Support confirmed 30/s +
2k burst, far above our ~1-2/s). Adds an `asyncio.Semaphore` keyed on a
fleet-wide `MODAL_MAX_STARTING` (default 128), divided by
`UNIAGENT_NUM_WORKERS` to derive each worker's share — same pattern
already used for the per-worker `agent_loop` semaphore. The permit is
held from `sandbox.create` through `runtime.create_session`, then
released so the long tool-call body does not occupy a permit.

**2) Bound a single trajectory's init wall-clock.**
Pre-patch: `max_retries=5 × startup_timeout=300s = up to 25 min` of one
trajectory hogging a STARTING permit. Reduce `max_retries` to 2 and add
`MODAL_INIT_WALL_BUDGET` (default 900s = 15 min) as a hard cap.
`asyncio.wait_for` around each `_start()` prevents a hung
`modal.Sandbox.create` from blocking past the deadline. When the budget
is exhausted the trajectory raises; the outer `agent_loop` converts it
into a `reward=0` masked sample.

**3) Guarantee Modal sandbox termination on teardown.**
Observed (round 12, 2026-05-18): `self._runtime.close()` raised
`aiohttp.ServerDisconnectedError` when the in-sandbox agent server had
already torn down its socket. Without `try/except`, `stop()` returned
early and `self._sandbox.terminate.aio()` never ran. After thousands of
trajectories ~847 sandboxes were leaked, hitting the account's
concurrent-sandbox cap and 100% failing new creates on subsequent runs.
Wraps each step; guarantees terminate is attempted (and retried once)
even when `runtime.close` fails.

### Checklist Before Starting

- [x] Search for similar PRs/issues:
- `gh pr list --repo verl-project/uni-agent --state open --search
"modal"` → none touching `modal/deployment.py` (PR #44 touches
`local/deployment.py` only)
- `gh issue list --search "modal sandbox"` → only RFC-level (#31
OpenYuanrong, #43 local SWE-Bench), nothing on STARTING-state limiter
- [x] Format the PR title as `[deployment] fix: ...` (single module,
single type)

### Test

Unit tests added in `tests/deployment/test_modal_starting_limiter.py`.

```bash
pytest tests/deployment/test_modal_starting_limiter.py -v
# 12 passed in 64.76s
```

Test coverage:
- Per-worker share math (`MAX_STARTING / NUM_WORKERS`, incl. clamp-to-1)
- Singleton + lazy init of `_STARTING_SEMA`
- `max_retries=2` (proves regression from old `=5`)
- Wall-budget short-circuits subsequent attempts when exhausted
- `asyncio.wait_for` cancels a hung `_start`
- Serialization across concurrent `_start` callers (peak ≤ permits)
- Permit released on failure path (no leak → no deadlock)

**Production validation**: 6+ consecutive 8-hour rollouter sessions
without the "299s runtime not start" cluster-collapse we hit pre-patch;
sandbox leak count returned to ~0 (was steady at 50-100/hr).

### API and Usage Example

Three new env-var knobs (all optional; defaults pre-tuned for 8 trays ×
`agent_concurrency=360`):

```bash
export UNIAGENT_NUM_WORKERS=8         # rollouter worker count
export MODAL_MAX_STARTING=128         # fleet-wide STARTING cap
export MODAL_INIT_WALL_BUDGET=900     # per-trajectory init seconds
```

No public API changes. Existing callers of `ModalDeployment.start()` /
`.stop()` work unchanged.

### Design & Code Changes

- `_get_starting_semaphore()` — lazy singleton, constructed inside the
  event loop on first use; per-worker share computed as
  `max(1, MAX_STARTING // NUM_WORKERS)`.
- `_start()` now wraps the entire create-tunnel-runtime block in
  `async with _get_starting_semaphore()`, releasing immediately after
  `runtime.create_session` returns.
- `start()` retry loop: `max_retries=5→2`, exp-backoff cap `30s→10s`,
  outer `MODAL_INIT_WALL_BUDGET` deadline, per-attempt
  `asyncio.wait_for(_start(), timeout=max(60, remaining))`.
- `stop()`: each cleanup step is `try/except`-wrapped so a transient
  failure in `runtime.close` does not skip `sandbox.terminate`;
  `terminate.aio()` retried once if first attempt fails.

### Checklist Before Submitting

- [x] Read the Contribute Guide
- [x] `pre-commit run --files uni_agent/deployment/modal/deployment.py
tests/deployment/test_modal_starting_limiter.py` — all hooks passed
(ruff, ruff format, mypy, compile-all)
- [x] Tests added (`tests/deployment/test_modal_starting_limiter.py`, 12
tests, runtime-agnostic — no Modal network calls)
- [x] AI assistance was used (Claude Code); the submitting human
(@aoshen02) reviewed every changed line and ran the test suite.
- [x] Not duplicating any open PR (PR #44 changes `local/deployment.py`,
not `modal/`)

---------

Signed-off-by: aoshen02 <aoshen@inferact.ai>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: yyDing1 <yyding.me@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants