AGENTS.md

This file provides guidance to agent driven code reviews when working with this repository.

Project Overview

Lemonade is a local LLM server providing GPU and NPU acceleration for running large language models on consumer hardware. It exposes OpenAI-compatible, Ollama-compatible, and Anthropic-compatible REST APIs, plus a WebSocket Realtime API. It supports multiple backends: llama.cpp, FastFlowLM, RyzenAI, whisper.cpp, stable-diffusion.cpp, and Kokoro TTS.

Architecture

Executables

lemond — Pure HTTP server. Handles REST API, routes requests to backends, manages model loading/unloading. Configured via config.json in the lemonade cache directory. CLI args: [cache_dir] [--port PORT] [--host HOST].
lemonade — CLI client (src/cpp/cli/). Commands: list, pull, delete, run, status, logs, launch, backends, scan, etc. Communicates with router via HTTP. Discovers running server via UDP beacon.
LemonadeServer.exe (Windows) — SUBSYSTEM:WINDOWS GUI app that embeds lemond and shows a system tray icon. Auto-starts via Windows startup folder.
lemonade-tray (macOS/Linux) — Lightweight tray client that connects to a running lemond. Platform code in src/cpp/tray/platform/.

Backend Abstraction

WrappedServer (src/cpp/include/lemon/wrapped_server.h) is the abstract base class. Each backend inherits it and implements load(), unload(), chat_completion(), completion(), responses(), and optionally install() / download_model(). Backends run as subprocesses — Lemonade forwards HTTP requests to them.

Backend	Class	Capabilities	Device	Purpose
llama.cpp	`LlamaCppServer`	Completion, Embeddings, Reranking	GPU	LLM inference — CPU/GPU (Vulkan, ROCm, Metal)
FastFlowLM	`FastFlowLMServer`	Completion, Embeddings, Reranking, Audio	NPU	NPU inference (multi-modal: LLM, ASR, embeddings, reranking)
RyzenAI	`RyzenAIServer`	Completion	NPU	Hybrid NPU inference
vLLM	`VLLMServer`	Completion	GPU	LLM inference — ROCm on AMD iGPU/dGPU (Linux). Experimental, validated only on gfx1151 (Strix Halo).
whisper.cpp	`WhisperServer`	Audio	CPU	Audio transcription
stable-diffusion.cpp	`SdServer`	Image	CPU	Image generation, editing, variations
Kokoro	`KokoroServer`	TTS	CPU	Text-to-speech

Capability interfaces: ICompletionServer, IEmbeddingsServer, IRerankingServer, ITranscriptionServer, IImageServer, ITextToSpeechServer (defined in server_capabilities.h). Use supports_capability<T>(server) template for runtime checks.

Router & Multi-Model Support

Router (src/cpp/server/router.cpp) manages a vector of WrappedServer instances. Routes requests based on model recipe, maintains LRU caches per model type (LLM, embedding, reranking, audio, image, TTS — see model_types.h), and enforces NPU exclusivity. Configurable via --max-loaded-models. On non-file-not-found errors, the router uses a "nuclear option" — evicts all models and retries the load.

Model Manager & Recipe System

ModelManager (src/cpp/server/model_manager.cpp) loads the registry from src/cpp/resources/server_models.json. Each model has "recipes" defining which backend and config to use. Backend versions are pinned in src/cpp/resources/backend_versions.json. Models download from Hugging Face.

API Routes

All core endpoints are registered under 4 path prefixes:

/api/v0/ — Legacy
/api/v1/ — Current
/v0/ — Legacy short
/v1/ — OpenAI SDK / LiteLLM compatibility

Core endpoints: chat/completions, completions, embeddings, reranking, models, models/{id}, health, pull, load, unload, delete, params, install, uninstall, audio/transcriptions, audio/speech, images/generations, images/edits, images/variations, responses, stats, system-info, system-stats, log-level, logs/stream

Ollama-compatible endpoints (under /api/ without version prefix): chat, generate, tags, show, delete, pull, embed, embeddings, ps, version

Anthropic-compatible endpoint: POST /api/messages — supports message completion, tool use, and SSE streaming.

WebSocket Realtime API: OpenAI-compatible Realtime protocol for real-time audio transcription. Binds to an OS-assigned port (9000+), exposed via the websocket_port field in the /health endpoint response.

Internal endpoints: POST /internal/shutdown

Optional API key auth via LEMONADE_API_KEY env var (regular API endpoints) or LEMONADE_ADMIN_API_KEY env var (full access including internal endpoints). Clients prefer LEMONADE_ADMIN_API_KEY if set. CORS enabled on all routes.

Desktop & Web App

Tauri app — React 19 + TypeScript in src/app/, Rust host in src/app/src-tauri/. Uses native OS webview (WebView2 on Windows, WKWebView on macOS, webkit2gtk on Linux). Pure CSS (dark theme), context-based state. Key components: ChatWindow.tsx, ModelManager.tsx, DownloadManager.tsx, BackendManager.tsx. Feature panels: LLMChat, ImageGeneration, Transcription, TTS, Embedding, Reranking. The renderer keeps its window.api contract via src/app/src/renderer/tauriShim.ts, which maps each call to a Tauri invoke() or event listen().
Web app — Browser-only version in src/web-app/. Reuses the shared renderer from src/app/src/ via webpack's entry/template paths (no OS symlinks); the BuildWebApp.cmake script stages both trees side-by-side under build/web-app-staging/ for the actual webpack build. Built via CMake BUILD_WEB_APP=ON. Served at /app. A mock window.api is injected by the C++ server (src/cpp/server/server.cpp) so the shared renderer works unchanged in the browser.

Key Dependencies

C++ (FetchContent): cpp-httplib, nlohmann/json, CLI11, libcurl, zstd, libwebsockets, brotli (macOS). Platform SSL: Schannel (Windows), SecureTransport (macOS), OpenSSL (Linux).

Desktop app: Tauri v2 (Rust), React 19, TypeScript 5.3, Webpack 5, markdown-it, highlight.js, katex. Rust crates: tauri, tauri-plugin-{opener,clipboard-manager,single-instance,deep-link}, tokio, reqwest, serde.

Build Commands

CMakeLists.txt is at the repository root. Build uses CMake presets — run the setup script first, then build with --preset.

# 1. Setup (configures build directory and installs deps)
./setup.sh          # Linux / macOS
./setup.ps1         # Windows (PowerShell)

# 2. Build C++ server
cmake --build --preset default          # Linux / macOS (Ninja)
cmake --build --preset windows          # Windows (Visual Studio 2022)
cmake --build --preset vs18             # Windows (Visual Studio 2026)

# 3. Tauri desktop app (optional, requires Node.js 20+ and Rust via rustup)
cmake --build --preset default --target tauri-app    # Linux / macOS
cmake --build --preset windows --target tauri-app    # Windows (VS 2022)
cmake --build --preset vs18 --target tauri-app       # Windows (VS 2026)

# 4. Web app (auto-built on all platforms)
cmake --build --preset default --target web-app         # Linux / macOS
cmake --build --preset windows --target web-app         # Windows

# 5. Windows MSI installer (WiX 5.0+ required)
cmake --build --preset windows --target wix_installer_minimal  # server + web-app
cmake --build --preset windows --target wix_installer_full     # server + Tauri app + web-app

# 6. macOS signed installer
cmake --build --preset default --target package-macos

# 7. Linux .deb / .rpm
cd build && cpack            # .deb
cd build && cpack -G RPM     # .rpm

CMake presets: default (Ninja, Release), windows (VS 2022), vs18 (VS 2026), debug (Ninja, Debug).

CMake options: BUILD_WEB_APP (ON by default on all platforms), BUILD_TAURI_APP (Linux only, include Tauri desktop app in deb), LEMONADE_SYSTEMD_UNIT_NAME (default: lemond.service).

Testing

Integration tests in Python against a live server. Tests auto-discover the lemonade CLI binary from the build directory; use --cli-binary to override.

pip install -r test/requirements.txt

# CLI tests (no inference backend needed)
python test/server_cli2.py

# Endpoint tests (no inference backend needed)
python test/server_endpoints.py

# LLM tests (specify wrapped server and backend)
python test/server_llm.py --wrapped-server llamacpp --backend vulkan

# Audio transcription tests
python test/server_whisper.py

# Image generation tests (slow)
python test/server_sd.py

Test utilities in test/utils/ with server_base.py as the base class. Test dependencies include requests, httpx, openai, huggingface_hub, psutil, numpy, websockets, and ollama.

Code Style

C++

C++17, lemon:: namespace
snake_case for functions/variables, CamelCase for classes/types
4-space indent, #pragma once for headers
Keep #include directives in alphabetical order within each include block
Platform guards: #ifdef _WIN32, #ifdef __APPLE__, #ifdef __linux__

Python

Black formatting (v26.1.0, enforced in CI)
Pylint with .pylintrc
Pre-commit hooks: trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files

TypeScript/React

React 19, pure CSS (dark theme), context-based state
UI/frontend changes are handled by core maintainers only

Key Files

File	Purpose
`CMakeLists.txt`	Root build config (version, deps, targets)
`src/cpp/server/server.cpp`	HTTP route registration and all handlers
`src/cpp/server/router.cpp`	Request routing and multi-model orchestration
`src/cpp/server/model_manager.cpp`	Model registry, downloads, recipe resolution
`src/cpp/include/lemon/wrapped_server.h`	Backend abstract base class
`src/cpp/include/lemon/server_capabilities.h`	Backend capability interfaces
`src/cpp/resources/server_models.json`	Model registry
`src/cpp/resources/backend_versions.json`	Backend version pins
`src/cpp/server/anthropic_api.cpp`	Anthropic API compatibility
`src/cpp/server/ollama_api.cpp`	Ollama API compatibility
`src/cpp/include/lemon/websocket_server.h`	WebSocket Realtime API server
`src/cpp/include/lemon/model_types.h`	Model type and device type enums
`src/cpp/include/lemon/config_file.h`	config.json load/save/migrate
`src/cpp/include/lemon/recipe_options.h`	Per-recipe JSON configuration
`src/cpp/tray/tray_app.cpp`	Tray application UI and logic
`src/app/src/renderer/ModelManager.tsx`	Model management UI
`src/app/src/renderer/ChatWindow.tsx`	Chat interface

Critical Invariants

These MUST be maintained in all changes:

Quad-prefix registration — Every new endpoint MUST be registered under /api/v0/, /api/v1/, /v0/, AND /v1/.
NPU exclusivity — Exclusive-NPU recipes (ryzenai-llm, whispercpp on NPU) evict ALL other NPU models before loading. FastFlowLM (flm) can coexist with other FLM types (max 1 per FLM type) but not with exclusive-NPU recipes.
WrappedServer contract — New backends MUST implement all core virtual methods: load(), unload(), chat_completion(), completion(), responses().
Subprocess model — Backends run as subprocesses (llama-server, whisper-server, sd-server, koko, flm, ryzenai-server). They must NOT run in-process.
Recipe integrity — Changes to server_models.json must have valid recipes referencing backends in backend_versions.json.
Cross-platform — Code must compile on Windows (MSVC), Linux (GCC/Clang), macOS (AppleClang). Platform-specific code must use #ifdef guards.
No hardcoded paths — Use path utilities. Windows/Linux/macOS paths differ.
Thread safety — Router serves concurrent HTTP requests. Shared state must be properly guarded.
Ollama compatibility — Changes to model listing or management must not break /api/* Ollama endpoints.
API key passthrough — When LEMONADE_API_KEY is set, all API routes must enforce authentication.
Many-clients-one-server topology — A single lemond can be driven by multiple desktop/tray/CLI clients, potentially on different machines. Per-client state (settings, layout, zoom, base URL, API key) MUST live locally in the client, never in lemond. Do not move app_settings.json behind an HTTP endpoint.
Web-app dependencies constrained by Debian native packaging — src/web-app/package.json is kept separate from src/app/package.json because the native Debian package (lemonade-server .deb) must build using only npm modules available in Debian's /usr/share/nodejs (see USE_SYSTEM_NODEJS_MODULES in src/web-app/webpack.config.js). The old Electron app depended on packages Debian does not ship. Do NOT consolidate the two package.json files — the split is required for reproducible distro packaging.
Desktop app is on-demand; lemond runs independently — On Windows, LemonadeServer.exe (which embeds lemond + tray icon) is the always-on process, auto-started via the Windows startup folder. The Tauri desktop app (lemonade-app.exe) is opened on demand when the user wants the UI and must not be added to startup. The desktop app must not embed or manage lemond's lifecycle — it discovers the already-running server (UDP beacon for local, explicit base URL for remote) and speaks to it over HTTP.

Contributing

Open an Issue before submitting major PRs
UI/frontend changes are handled by core maintainers only
Python formatting with Black is required
PRs trigger CI for linting, formatting, and integration tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Project Overview

Architecture

Executables

Backend Abstraction

Router & Multi-Model Support

Model Manager & Recipe System

API Routes

Desktop & Web App

Key Dependencies

Build Commands

Testing

Code Style

C++

Python

TypeScript/React

Key Files

Critical Invariants

Contributing

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Project Overview

Architecture

Executables

Backend Abstraction

Router & Multi-Model Support

Model Manager & Recipe System

API Routes

Desktop & Web App

Key Dependencies

Build Commands

Testing

Code Style

C++

Python

TypeScript/React

Key Files

Critical Invariants

Contributing