[Feature Request] Multi-node support for server, benchmark, and trace collection

## Summary

Magpie currently supports single-node vLLM/SGLang server launch, benchmarking, and trace collection. As inference workloads move toward larger models (e.g. DeepSeek-R1, GLM-5, MiniMax-M2.5, Qwen3.5) that no longer fit on a single node, we need first-class support for **multi-node serving and benchmarking** in Magpie.

## Use case

Downstream automation tools that wrap Magpie need to optimize models that require **TP / PP / EP across multiple nodes**. Today this either requires (a) bypassing Magpie and manually launching multi-node servers, or (b) restricting optimization to single-node configurations, which excludes the large-model class entirely.

## Required capabilities

### 1. Multi-node server

- Launch vLLM / SGLang server with `TP * PP * DP > single-node-GPUs` spanning multiple nodes
- Coordinate worker discovery (head node + worker nodes), shared NFS / object store for model weights, and inter-node networking (RCCL/NCCL, RDMA where available)
- Reuse the existing `scheduler.ray` execution path where possible — Ray cluster already supports multi-node, but the launch + readiness + health-check logic for multi-node servers needs explicit coverage
- Configurable in `Magpie/config.yaml` (e.g. `scheduler.nodes`, `scheduler.head_node`, `scheduler.worker_nodes`) and surfaced in the benchmark config

### 2. Multi-node benchmark

- Benchmark client able to target a multi-node server through a single endpoint (head node) with correct request distribution
- Per-node and aggregated metrics: throughput, TTFT, TPOT, GPU utilization
- Failure handling: if a worker node dies mid-run, surface a clear error (vs. silently degraded numbers)

### 3. Multi-node trace collection

- Coordinate `torch.profiler` / `SGLANG_TORCH_PROFILER_DIR` / vLLM `--profiler-config` activation **simultaneously** across all nodes
- Collect per-rank trace files from every worker node back to the head node (or to shared NFS)
- Naming convention that preserves node + TP rank information, e.g. `node{N}-TP{R}.trace.json.gz`, so downstream tools (TraceLens) can identify per-node behavior
- Optional aggregation / merging step for cross-node analysis (e.g. comm overlap across nodes)

## Out of scope (for this issue)

- Auto-scaling the cluster size based on workload
- GUI for multi-node topology visualization
- Failover / restart of a dead worker mid-run

## Suggested approach

- Build on top of the existing `Remote Ray Cluster` execution environment
- Add multi-node launch helpers under `Magpie/scheduler/` (or wherever the Ray integration lives today)
- Update `examples/` with a multi-node benchmark config sample
- Document the new flow in `README.md` and `docs/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Multi-node support for server, benchmark, and trace collection #30

Summary

Use case

Required capabilities

1. Multi-node server

2. Multi-node benchmark

3. Multi-node trace collection

Out of scope (for this issue)

Suggested approach

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Feature Request] Multi-node support for server, benchmark, and trace collection #30

Description

Summary

Use case

Required capabilities

1. Multi-node server

2. Multi-node benchmark

3. Multi-node trace collection

Out of scope (for this issue)

Suggested approach

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions