Skip to content

Latest commit

 

History

History
71 lines (49 loc) · 4.21 KB

File metadata and controls

71 lines (49 loc) · 4.21 KB

LexiPlan 2.0 — LLM Deployment & Usage Guide

This document outlines how the Large Language Model (LLM) inference is structured in LexiPlan 2.0, how it is deployed, and how to configure it when setting up the project on a new device or environment.

1. Current Architecture

LexiPlan 2.0 currently uses Gemma 4 26B MoE (A4B) as its core inference engine.

  • Deployment Method: GCP Model Garden One-Click Deploy.
  • Region: europe-west4 (Netherlands).
  • Inference Engine: vLLM (OpenAI-compatible API).

Why Gemma 4 26B MoE?

We previously experimented with a 31B dense model on a single RTX Pro 6000, but faced memory constraints (OOM errors) and slow inference. The 26B Mixture-of-Experts (MoE) model provides an excellent balance:

  • It fits comfortably within GPU memory limits (native bfloat16 without aggressive quantization).
  • It leaves ample room (~18GB) for the KV Cache, allowing high throughput for long-context legal documents.

2. Backend Configuration

The FastAPI backend uses LangChain (ChatOpenAI) to connect to the vLLM instance. There are two "clients" configured in backend/llm/gemma_client.py:

  1. llm_heavy: Used for heavy reasoning tasks (Information Extraction, Action Plan Generation). It has a larger max_tokens budget (4096).
  2. llm_light: Used for lighter tasks (Context Retrieval, Confidence Scoring). It has a smaller max_tokens budget (2048).

Important: In the current setup, both clients point to the same Gemma 4 26B MoE endpoint. They only differ in their token budgets.

Gemma 4 Specific Quirks

Our configuration (backend/llm/structured_output.py and backend/llm/gemma_client.py) includes specific parameters to handle known behaviors of the Gemma 4 family:

  • repetition_penalty: 1.05: Mitigates the Gemma 4 repetition loop bug.
  • enable_thinking: False: Explicitly disables thinking mode during structured JSON output generation to prevent conflicts with vLLM's reasoning-parser. Thinking is only enabled for specific chain-of-thought steps.

3. Setup Instructions for a New Environment

When pulling this repository to a new machine or setting up a new environment, follow these steps to connect the backend to the LLM:

Step A: Model Deployment

  1. Navigate to GCP Vertex AI Model Garden.
  2. Search for Gemma 4 26B MoE.
  3. Use the One-Click Deploy option to deploy the model via vLLM.
  4. Ensure it is deployed in your target region (e.g., europe-west4).
  5. Once deployed, copy the External IP or Endpoint URL provided by the service.

Step B: Environment Variables

  1. Copy .env.example to .env in the root of the project.
  2. Update the LLM endpoint URLs in your .env file with the URL from Step A.
# ── LLM Endpoints ──────────────────────────────────────────────────────
# Format: http://<EXTERNAL_IP>:8000
GEMMA_31B_URL=http://YOUR_MODEL_GARDEN_IP:8000
GEMMA_MOE_URL=http://YOUR_MODEL_GARDEN_IP:8000

(Note: Keep the /v1 off the base URL in the .env file; the code appends it automatically).

Step C: Local / Fallback Development (Mock Client)

If you are developing locally and do not want to incur cloud GPU costs, the application is designed to gracefully fall back to a MockStructuredClient.

If the FastAPI backend cannot reach the URLs specified in GEMMA_31B_URL, the pipeline nodes will automatically use mock data (defined in backend/llm/_mock_structured_client.py). This allows UI/UX and backend engineers to build and test the LangGraph workflow without a running LLM.


4. Legacy K8s Deployment Note

You may notice a file at k8s/gemma4-31b-deployment.yaml.

  • What is it? This was our previous attempt to run the model manually on a self-managed GKE cluster using an RTX Pro 6000 node pool.
  • Status: It is currently scaled to 0 replicas (kubectl scale deployment gemma4-31b -n lexiplan --replicas=0) to avoid conflicting with the Model Garden deployment and to save costs.
  • Usage: It remains in the repository solely as a backup configuration in case we ever need to move away from the managed Model Garden service back to raw GKE pods.