Skip to content

feat: vLLM-Omni backend (native omni: text/audio/vision) + chat UI#2527

Draft
ramkrishna2910 wants to merge 2 commits into
mainfrom
feat/vllm-omni-backend
Draft

feat: vLLM-Omni backend (native omni: text/audio/vision) + chat UI#2527
ramkrishna2910 wants to merge 2 commits into
mainfrom
feat/vllm-omni-backend

Conversation

@ramkrishna2910

Copy link
Copy Markdown
Contributor

Draft — backend fully validated end-to-end on gfx1151; UI typechecks but hasn't been visually run (see Validation). Depends on the vllm-omni* release from lemonade-sdk/vllm-rocm#23 (merged).

What

Adds vLLM-Omni as a first-class backend so omni / any-to-any multimodal models (Qwen2.5-Omni today; Cosmos3 later) run with ROCm acceleration on AMD gfx1151 (Strix Halo) — a single model doing text + audio + vision in, text + native voice out.

vLLM-Omni is a pure-Python layer on the same base vLLM+PyTorch+Triton, shipped as a separate vllm-omni* release artifact, so it gets its own recipe + pin (see vllm-rocm#23).

Backend (54788cc9) — uses the new self-describing descriptor model

Per docs/dev/adding-a-backend.md — one folder + a few appends, no router/CLI/doc edits:

  • src/cpp/{include,server}/backends/vllm_omni/ — descriptor + VLLMOmniServer (mirrors vllm). Launches vllm-omni-server serve <model> --omni --deploy-config <yaml> --served-model-name --max-model-len, resolves the per-model deploy config via the model's deploy_config extra key + get_resource_path, forwards chat. Native voice/vision ride through the chat body.
  • CMakeLists.txt LEMON_BACKENDS += "vllm-omni|vllm_omni"
  • backend_versions.json pin vllm-omni0.23.0rc1-rocm7.14.0
  • server_models.json Qwen2.5-Omni-3B-vLLM-Omni
  • resources/omni_deploy/qwen2_5_omni_1gpu.yaml — single-GPU stage colocation (upstream defaults are multi-GPU)

Chat UI (a423eab3)

  • ChatWindow derives isAudioOutput from a chat-speech label (mirrors chat-transcription).
  • LLMChatPanel: voice toggle + picker (Chelsie/Ethan, off by default — speech gen is slow); handleAudioChat sends a non-streaming modalities:[text,audio]+voice request and captures message.audio (handles vLLM-Omni's two-choice shape and OpenAI's single-choice) → rendered via the existing MessageAudio player. No new rendering code.
  • Model labels vision/chat-transcription/chat-speech/omni light up vision input, audio input, and native voice output via existing capability flags.

Validation

  • Backend: end-to-end on real gfx1151. lemond builds, registers the recipe, lists the model, loads it (launches the subprocess), and all four modalities pass: text, native voice out (24 kHz WAV), audio in, vision in — with the audio choice passed through lemonade's handler intact.
  • UI: tsc --noEmit clean. Reuses proven audio-player/artifact path. Not yet run in the app (needs a webview) — reviewers should eyeball rendering/UX.

Notes for reviewers

  • UI uses emoji icons (🔊/🔈) + one inline style on the voice-control wrapper — swap for house icon components / CSS.
  • Omni over-generates audio length today (talker max_tokens tuning in the deploy YAML) — cosmetic.
  • gfx1151-only (the qualified/hardware-validated omni target).

🤖 Generated with Claude Code

github-actions[bot] and others added 2 commits July 1, 2026 17:23
Adds the vllm-omni recipe using the descriptor backend model: a folder
(vllm_omni/) with the descriptor + WrappedServer subclass, one CMake
LEMON_BACKENDS line, a backend_versions.json pin, a server_models.json entry,
and a bundled single-GPU deploy config.

The server launches the bundle's vllm-omni-server with
'serve <model> --omni --deploy-config <yaml>', resolves the per-model deploy
config via the model's extra 'deploy_config' key + get_resource_path, and
forwards OpenAI-compatible chat requests. Native voice / vision ride through the
chat body (audio output returns as a second choice).

Validated end-to-end on gfx1151: lemond builds + registers the recipe, the model
lists, loading through lemonade launches the subprocess, and all four modalities
work (text, native voice out, audio in, vision in) with the audio choice passed
through lemonade's handler intact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surface vLLM-Omni's native audio output in the chat UI:
- ChatWindow derives isAudioOutput from a 'chat-speech' label (mirrors the
  'chat-transcription' audio-input flag) and passes it to LLMChatPanel.
- LLMChatPanel gains a voice toggle + voice picker (Chelsie/Ethan), off by
  default (speech gen is slow). When on, handleAudioChat sends a non-streaming
  request with modalities:[text,audio] + voice, and captures message.audio from
  the response (handles vLLM-Omni's two-choice shape and OpenAI's single-choice)
  into an audio artifact — rendered via the existing MessageAudio player.
- Qwen2.5-Omni model labels updated to vision/chat-transcription/chat-speech/omni
  so vision input, audio input, and native voice output all light up.

Reuses existing primitives (MessageAudio, buildFinalContent, input_audio/
image_url content types); no player/rendering changes. tsc --noEmit clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added engine::vllm vLLM backend (experimental, ROCm Linux, Strix Halo) area::api HTTP REST API surface and route handlers enhancement New feature or request labels Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area::api HTTP REST API surface and route handlers engine::vllm vLLM backend (experimental, ROCm Linux, Strix Halo) enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant