You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a model-specific OpenAI-compatible serving launcher for Qwen3.5-MoE.
The Python process stays as the control plane for HTTP, chat templating,
request validation, session affinity, and Qwen tool parsing; model
execution stays in the C++ qwen3_5_moe_worker process through the
generic examples/llm_server JSONL protocol.
This keeps Qwen-specific serving glue in examples/models/qwen3_5_moe
while reusing the generic server runtime. It also keeps the existing C++
runner path intact: the serving entrypoint is a wrapper around the
worker/engine path, not a replacement for main.cpp.
Add CUDA e2e serving smoke coverage for the Qwen artifact job. The
test-model-cuda-e2e job runs in a fresh environment after downloading
exported artifacts, so install ExecuTorch in editable mode before
invoking python -m executorch.examples.models.qwen3_5_moe.serve.
Validation:
- python -m pytest -q examples/models/qwen3_5_moe/test_serve.py: 8
passed
- python -m py_compile examples/models/qwen3_5_moe/serve.py
examples/models/qwen3_5_moe/test_serve.py
- bash -n .ci/scripts/test_model_e2e.sh
- Qwen BFCL serving slice with Pi-style session-affinity headers: 50/56
generated-slice pass rate (89.29%); parallel, parallel_multiple, and
irrelevance categories passed 100%; live_multiple was the weakest slice
at 4/6.
- Pi integration uses the OpenAI-compatible endpoint plus session_id /
x-session-affinity headers. Subagent fanout must run with enough
--max-sessions for the concurrent named sessions; otherwise the expected
behavior is a 429 capacity_exhausted response instead of silently
duplicating model weights.
0 commit comments