This file provides guidance to agent driven code reviews when working with this repository.
Lemonade is a local LLM server providing GPU and NPU acceleration for running large language models on consumer hardware. It exposes OpenAI-compatible, Ollama-compatible, and Anthropic-compatible REST APIs, plus a WebSocket Realtime API. It supports multiple backends: llama.cpp, FastFlowLM, RyzenAI, whisper.cpp, stable-diffusion.cpp, and Kokoro TTS.
- lemond — Pure HTTP server. Handles REST API, routes requests to backends, manages model loading/unloading. Configured via
config.jsonin the lemonade cache directory. CLI args:[cache_dir] [--port PORT] [--host HOST]. - lemonade — CLI client (
src/cpp/cli/). Commands:list,pull,delete,run,status,logs,launch,backends,scan, etc. Communicates with router via HTTP. Discovers running server via UDP beacon. - LemonadeServer.exe (Windows) — SUBSYSTEM:WINDOWS GUI app that embeds
lemondand shows a system tray icon. Auto-starts via Windows startup folder. - lemonade-tray (macOS/Linux) — Lightweight tray client that connects to a running
lemond. Platform code insrc/cpp/tray/platform/.
WrappedServer (src/cpp/include/lemon/wrapped_server.h) is the abstract base class. Each backend inherits it and implements load(), unload(), chat_completion(), completion(), responses(), and optionally install() / download_model(). Backends run as subprocesses — Lemonade forwards HTTP requests to them.
| Backend | Class | Capabilities | Device | Purpose |
|---|---|---|---|---|
| llama.cpp | LlamaCppServer |
Completion, Embeddings, Reranking | GPU | LLM inference — CPU/GPU (Vulkan, ROCm, Metal) |
| FastFlowLM | FastFlowLMServer |
Completion, Embeddings, Reranking, Audio | NPU | NPU inference (multi-modal: LLM, ASR, embeddings, reranking) |
| RyzenAI | RyzenAIServer |
Completion | NPU | Hybrid NPU inference |
| vLLM | VLLMServer |
Completion | GPU | LLM inference — ROCm on AMD iGPU/dGPU (Linux). Experimental, validated only on gfx1151 (Strix Halo). |
| whisper.cpp | WhisperServer |
Audio | CPU | Audio transcription |
| stable-diffusion.cpp | SdServer |
Image | CPU | Image generation, editing, variations |
| Kokoro | KokoroServer |
TTS | CPU | Text-to-speech |
Capability interfaces: ICompletionServer, IEmbeddingsServer, IRerankingServer, ITranscriptionServer, IImageServer, ITextToSpeechServer (defined in server_capabilities.h). Use supports_capability<T>(server) template for runtime checks.
Router (src/cpp/server/router.cpp) manages a vector of WrappedServer instances. Routes requests based on model recipe, maintains LRU caches per model type (LLM, embedding, reranking, audio, image, TTS — see model_types.h), and enforces NPU exclusivity. Configurable via --max-loaded-models. On non-file-not-found errors, the router uses a "nuclear option" — evicts all models and retries the load.
ModelManager (src/cpp/server/model_manager.cpp) loads the registry from src/cpp/resources/server_models.json. Each model has "recipes" defining which backend and config to use. Backend versions are pinned in src/cpp/resources/backend_versions.json. Models download from Hugging Face.
All core endpoints are registered under 4 path prefixes:
/api/v0/— Legacy/api/v1/— Current/v0/— Legacy short/v1/— OpenAI SDK / LiteLLM compatibility
Core endpoints: chat/completions, completions, embeddings, reranking, models, models/{id}, health, pull, load, unload, delete, params, install, uninstall, audio/transcriptions, audio/speech, images/generations, images/edits, images/variations, responses, stats, system-info, system-stats, log-level, logs/stream
Ollama-compatible endpoints (under /api/ without version prefix): chat, generate, tags, show, delete, pull, embed, embeddings, ps, version
Anthropic-compatible endpoint: POST /api/messages — supports message completion, tool use, and SSE streaming.
WebSocket Realtime API: OpenAI-compatible Realtime protocol for real-time audio transcription. Binds to an OS-assigned port (9000+), exposed via the websocket_port field in the /health endpoint response.
Internal endpoints: POST /internal/shutdown
Optional API key auth via LEMONADE_API_KEY env var (regular API endpoints) or LEMONADE_ADMIN_API_KEY env var (full access including internal endpoints). Clients prefer LEMONADE_ADMIN_API_KEY if set. CORS enabled on all routes.
- Tauri app — React 19 + TypeScript in
src/app/, Rust host insrc/app/src-tauri/. Uses native OS webview (WebView2 on Windows, WKWebView on macOS, webkit2gtk on Linux). Pure CSS (dark theme), context-based state. Key components:ChatWindow.tsx,ModelManager.tsx,DownloadManager.tsx,BackendManager.tsx. Feature panels: LLMChat, ImageGeneration, Transcription, TTS, Embedding, Reranking. The renderer keeps itswindow.apicontract viasrc/app/src/renderer/tauriShim.ts, which maps each call to a Tauriinvoke()or eventlisten(). - Web app — Browser-only version in
src/web-app/. Reuses the shared renderer fromsrc/app/src/via webpack'sentry/templatepaths (no OS symlinks); theBuildWebApp.cmakescript stages both trees side-by-side underbuild/web-app-staging/for the actual webpack build. Built via CMakeBUILD_WEB_APP=ON. Served at/app. A mockwindow.apiis injected by the C++ server (src/cpp/server/server.cpp) so the shared renderer works unchanged in the browser.
C++ (FetchContent): cpp-httplib, nlohmann/json, CLI11, libcurl, zstd, libwebsockets, brotli (macOS). Platform SSL: Schannel (Windows), SecureTransport (macOS), OpenSSL (Linux).
Desktop app: Tauri v2 (Rust), React 19, TypeScript 5.3, Webpack 5, markdown-it, highlight.js, katex. Rust crates: tauri, tauri-plugin-{opener,clipboard-manager,single-instance,deep-link}, tokio, reqwest, serde.
CMakeLists.txt is at the repository root. Build uses CMake presets — run the setup script first, then build with --preset.
# 1. Setup (configures build directory and installs deps)
./setup.sh # Linux / macOS
./setup.ps1 # Windows (PowerShell)
# 2. Build C++ server
cmake --build --preset default # Linux / macOS (Ninja)
cmake --build --preset windows # Windows (Visual Studio 2022)
cmake --build --preset vs18 # Windows (Visual Studio 2026)
# 3. Tauri desktop app (optional, requires Node.js 20+ and Rust via rustup)
cmake --build --preset default --target tauri-app # Linux / macOS
cmake --build --preset windows --target tauri-app # Windows (VS 2022)
cmake --build --preset vs18 --target tauri-app # Windows (VS 2026)
# 4. Web app (auto-built on all platforms)
cmake --build --preset default --target web-app # Linux / macOS
cmake --build --preset windows --target web-app # Windows
# 5. Windows MSI installer (WiX 5.0+ required)
cmake --build --preset windows --target wix_installer_minimal # server + web-app
cmake --build --preset windows --target wix_installer_full # server + Tauri app + web-app
# 6. macOS signed installer
cmake --build --preset default --target package-macos
# 7. Linux .deb / .rpm
cd build && cpack # .deb
cd build && cpack -G RPM # .rpmCMake presets: default (Ninja, Release), windows (VS 2022), vs18 (VS 2026), debug (Ninja, Debug).
CMake options: BUILD_WEB_APP (ON by default on all platforms), BUILD_TAURI_APP (Linux only, include Tauri desktop app in deb), LEMONADE_SYSTEMD_UNIT_NAME (default: lemond.service).
Integration tests in Python against a live server. Tests auto-discover the lemonade CLI binary from the build directory; use --cli-binary to override.
pip install -r test/requirements.txt
# CLI tests (no inference backend needed)
python test/server_cli2.py
# Endpoint tests (no inference backend needed)
python test/server_endpoints.py
# LLM tests (specify wrapped server and backend)
python test/server_llm.py --wrapped-server llamacpp --backend vulkan
# Audio transcription tests
python test/server_whisper.py
# Image generation tests (slow)
python test/server_sd.pyTest utilities in test/utils/ with server_base.py as the base class. Test dependencies include requests, httpx, openai, huggingface_hub, psutil, numpy, websockets, and ollama.
- C++17,
lemon::namespace snake_casefor functions/variables,CamelCasefor classes/types- 4-space indent,
#pragma oncefor headers - Keep
#includedirectives in alphabetical order within each include block - Platform guards:
#ifdef _WIN32,#ifdef __APPLE__,#ifdef __linux__
- Black formatting (v26.1.0, enforced in CI)
- Pylint with
.pylintrc - Pre-commit hooks: trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files
- React 19, pure CSS (dark theme), context-based state
- UI/frontend changes are handled by core maintainers only
| File | Purpose |
|---|---|
CMakeLists.txt |
Root build config (version, deps, targets) |
src/cpp/server/server.cpp |
HTTP route registration and all handlers |
src/cpp/server/router.cpp |
Request routing and multi-model orchestration |
src/cpp/server/model_manager.cpp |
Model registry, downloads, recipe resolution |
src/cpp/include/lemon/wrapped_server.h |
Backend abstract base class |
src/cpp/include/lemon/server_capabilities.h |
Backend capability interfaces |
src/cpp/resources/server_models.json |
Model registry |
src/cpp/resources/backend_versions.json |
Backend version pins |
src/cpp/server/anthropic_api.cpp |
Anthropic API compatibility |
src/cpp/server/ollama_api.cpp |
Ollama API compatibility |
src/cpp/include/lemon/websocket_server.h |
WebSocket Realtime API server |
src/cpp/include/lemon/model_types.h |
Model type and device type enums |
src/cpp/include/lemon/config_file.h |
config.json load/save/migrate |
src/cpp/include/lemon/recipe_options.h |
Per-recipe JSON configuration |
src/cpp/tray/tray_app.cpp |
Tray application UI and logic |
src/app/src/renderer/ModelManager.tsx |
Model management UI |
src/app/src/renderer/ChatWindow.tsx |
Chat interface |
These MUST be maintained in all changes:
- Quad-prefix registration — Every new endpoint MUST be registered under
/api/v0/,/api/v1/,/v0/, AND/v1/. - NPU exclusivity — Exclusive-NPU recipes (
ryzenai-llm,whispercppon NPU) evict ALL other NPU models before loading. FastFlowLM (flm) can coexist with other FLM types (max 1 per FLM type) but not with exclusive-NPU recipes. - WrappedServer contract — New backends MUST implement all core virtual methods:
load(),unload(),chat_completion(),completion(),responses(). - Subprocess model — Backends run as subprocesses (llama-server, whisper-server, sd-server, koko, flm, ryzenai-server). They must NOT run in-process.
- Recipe integrity — Changes to
server_models.jsonmust have valid recipes referencing backends inbackend_versions.json. - Cross-platform — Code must compile on Windows (MSVC), Linux (GCC/Clang), macOS (AppleClang). Platform-specific code must use
#ifdefguards. - No hardcoded paths — Use path utilities. Windows/Linux/macOS paths differ.
- Thread safety — Router serves concurrent HTTP requests. Shared state must be properly guarded.
- Ollama compatibility — Changes to model listing or management must not break
/api/*Ollama endpoints. - API key passthrough — When
LEMONADE_API_KEYis set, all API routes must enforce authentication. - Many-clients-one-server topology — A single
lemondcan be driven by multiple desktop/tray/CLI clients, potentially on different machines. Per-client state (settings, layout, zoom, base URL, API key) MUST live locally in the client, never inlemond. Do not moveapp_settings.jsonbehind an HTTP endpoint. - Web-app dependencies constrained by Debian native packaging —
src/web-app/package.jsonis kept separate fromsrc/app/package.jsonbecause the native Debian package (lemonade-server.deb) must build using only npm modules available in Debian's/usr/share/nodejs(seeUSE_SYSTEM_NODEJS_MODULESinsrc/web-app/webpack.config.js). The old Electron app depended on packages Debian does not ship. Do NOT consolidate the twopackage.jsonfiles — the split is required for reproducible distro packaging. - Desktop app is on-demand;
lemondruns independently — On Windows,LemonadeServer.exe(which embedslemond+ tray icon) is the always-on process, auto-started via the Windows startup folder. The Tauri desktop app (lemonade-app.exe) is opened on demand when the user wants the UI and must not be added to startup. The desktop app must not embed or managelemond's lifecycle — it discovers the already-running server (UDP beacon for local, explicit base URL for remote) and speaks to it over HTTP.
- Open an Issue before submitting major PRs
- UI/frontend changes are handled by core maintainers only
- Python formatting with Black is required
- PRs trigger CI for linting, formatting, and integration tests