feat: vLLM-Omni backend (native omni: text/audio/vision) + chat UI by ramkrishna2910 · Pull Request #2527 · lemonade-sdk/lemonade

ramkrishna2910 · 2026-07-02T03:22:49Z

Draft — backend fully validated end-to-end on gfx1151; UI typechecks but hasn't been visually run (see Validation). Depends on the vllm-omni* release from lemonade-sdk/vllm-rocm#23 (merged).

What

Adds vLLM-Omni as a first-class backend so omni / any-to-any multimodal models (Qwen2.5-Omni today; Cosmos3 later) run with ROCm acceleration on AMD gfx1151 (Strix Halo) — a single model doing text + audio + vision in, text + native voice out.

vLLM-Omni is a pure-Python layer on the same base vLLM+PyTorch+Triton, shipped as a separate vllm-omni* release artifact, so it gets its own recipe + pin (see vllm-rocm#23).

Backend (`54788cc9`) — uses the new self-describing descriptor model

Per docs/dev/adding-a-backend.md — one folder + a few appends, no router/CLI/doc edits:

src/cpp/{include,server}/backends/vllm_omni/ — descriptor + VLLMOmniServer (mirrors vllm). Launches vllm-omni-server serve <model> --omni --deploy-config <yaml> --served-model-name --max-model-len, resolves the per-model deploy config via the model's deploy_config extra key + get_resource_path, forwards chat. Native voice/vision ride through the chat body.
CMakeLists.txt LEMON_BACKENDS += "vllm-omni|vllm_omni"
backend_versions.json pin vllm-omni0.23.0rc1-rocm7.14.0
server_models.json Qwen2.5-Omni-3B-vLLM-Omni
resources/omni_deploy/qwen2_5_omni_1gpu.yaml — single-GPU stage colocation (upstream defaults are multi-GPU)

Chat UI (`a423eab3`)

ChatWindow derives isAudioOutput from a chat-speech label (mirrors chat-transcription).
LLMChatPanel: voice toggle + picker (Chelsie/Ethan, off by default — speech gen is slow); handleAudioChat sends a non-streaming modalities:[text,audio]+voice request and captures message.audio (handles vLLM-Omni's two-choice shape and OpenAI's single-choice) → rendered via the existing MessageAudio player. No new rendering code.
Model labels vision/chat-transcription/chat-speech/omni light up vision input, audio input, and native voice output via existing capability flags.

Validation

Backend: end-to-end on real gfx1151. lemond builds, registers the recipe, lists the model, loads it (launches the subprocess), and all four modalities pass: text, native voice out (24 kHz WAV), audio in, vision in — with the audio choice passed through lemonade's handler intact.
UI: tsc --noEmit clean. Reuses proven audio-player/artifact path. Not yet run in the app (needs a webview) — reviewers should eyeball rendering/UX.

Notes for reviewers

UI uses emoji icons (🔊/🔈) + one inline style on the voice-control wrapper — swap for house icon components / CSS.
Omni over-generates audio length today (talker max_tokens tuning in the deploy YAML) — cosmetic.
gfx1151-only (the qualified/hardware-validated omni target).

🤖 Generated with Claude Code

Adds the vllm-omni recipe using the descriptor backend model: a folder (vllm_omni/) with the descriptor + WrappedServer subclass, one CMake LEMON_BACKENDS line, a backend_versions.json pin, a server_models.json entry, and a bundled single-GPU deploy config. The server launches the bundle's vllm-omni-server with 'serve <model> --omni --deploy-config <yaml>', resolves the per-model deploy config via the model's extra 'deploy_config' key + get_resource_path, and forwards OpenAI-compatible chat requests. Native voice / vision ride through the chat body (audio output returns as a second choice). Validated end-to-end on gfx1151: lemond builds + registers the recipe, the model lists, loading through lemonade launches the subprocess, and all four modalities work (text, native voice out, audio in, vision in) with the audio choice passed through lemonade's handler intact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Surface vLLM-Omni's native audio output in the chat UI: - ChatWindow derives isAudioOutput from a 'chat-speech' label (mirrors the 'chat-transcription' audio-input flag) and passes it to LLMChatPanel. - LLMChatPanel gains a voice toggle + voice picker (Chelsie/Ethan), off by default (speech gen is slow). When on, handleAudioChat sends a non-streaming request with modalities:[text,audio] + voice, and captures message.audio from the response (handles vLLM-Omni's two-choice shape and OpenAI's single-choice) into an audio artifact — rendered via the existing MessageAudio player. - Qwen2.5-Omni model labels updated to vision/chat-transcription/chat-speech/omni so vision input, audio input, and native voice output all light up. Reuses existing primitives (MessageAudio, buildFinalContent, input_audio/ image_url content types); no player/rendering changes. tsc --noEmit clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions[bot] and others added 2 commits July 1, 2026 17:23

github-actions Bot added engine::vllm vLLM backend (experimental, ROCm Linux, Strix Halo) area::api HTTP REST API surface and route handlers enhancement New feature or request labels Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: vLLM-Omni backend (native omni: text/audio/vision) + chat UI#2527

feat: vLLM-Omni backend (native omni: text/audio/vision) + chat UI#2527
ramkrishna2910 wants to merge 2 commits into
mainfrom
feat/vllm-omni-backend

ramkrishna2910 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ramkrishna2910 commented Jul 2, 2026

What

Backend (54788cc9) — uses the new self-describing descriptor model

Chat UI (a423eab3)

Validation

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Backend (`54788cc9`) — uses the new self-describing descriptor model

Chat UI (`a423eab3`)