This document outlines how the Large Language Model (LLM) inference is structured in LexiPlan 2.0, how it is deployed, and how to configure it when setting up the project on a new device or environment.
LexiPlan 2.0 currently uses Gemma 4 26B MoE (A4B) as its core inference engine.
- Deployment Method: GCP Model Garden One-Click Deploy.
- Region:
europe-west4(Netherlands). - Inference Engine: vLLM (OpenAI-compatible API).
We previously experimented with a 31B dense model on a single RTX Pro 6000, but faced memory constraints (OOM errors) and slow inference. The 26B Mixture-of-Experts (MoE) model provides an excellent balance:
- It fits comfortably within GPU memory limits (native
bfloat16without aggressive quantization). - It leaves ample room (~18GB) for the KV Cache, allowing high throughput for long-context legal documents.
The FastAPI backend uses LangChain (ChatOpenAI) to connect to the vLLM instance. There are two "clients" configured in backend/llm/gemma_client.py:
llm_heavy: Used for heavy reasoning tasks (Information Extraction, Action Plan Generation). It has a largermax_tokensbudget (4096).llm_light: Used for lighter tasks (Context Retrieval, Confidence Scoring). It has a smallermax_tokensbudget (2048).
Important: In the current setup, both clients point to the same Gemma 4 26B MoE endpoint. They only differ in their token budgets.
Our configuration (backend/llm/structured_output.py and backend/llm/gemma_client.py) includes specific parameters to handle known behaviors of the Gemma 4 family:
repetition_penalty: 1.05: Mitigates the Gemma 4 repetition loop bug.enable_thinking: False: Explicitly disables thinking mode during structured JSON output generation to prevent conflicts with vLLM'sreasoning-parser. Thinking is only enabled for specific chain-of-thought steps.
When pulling this repository to a new machine or setting up a new environment, follow these steps to connect the backend to the LLM:
- Navigate to GCP Vertex AI Model Garden.
- Search for Gemma 4 26B MoE.
- Use the One-Click Deploy option to deploy the model via vLLM.
- Ensure it is deployed in your target region (e.g.,
europe-west4). - Once deployed, copy the External IP or Endpoint URL provided by the service.
- Copy
.env.exampleto.envin the root of the project. - Update the LLM endpoint URLs in your
.envfile with the URL from Step A.
# ── LLM Endpoints ──────────────────────────────────────────────────────
# Format: http://<EXTERNAL_IP>:8000
GEMMA_31B_URL=http://YOUR_MODEL_GARDEN_IP:8000
GEMMA_MOE_URL=http://YOUR_MODEL_GARDEN_IP:8000(Note: Keep the /v1 off the base URL in the .env file; the code appends it automatically).
If you are developing locally and do not want to incur cloud GPU costs, the application is designed to gracefully fall back to a MockStructuredClient.
If the FastAPI backend cannot reach the URLs specified in GEMMA_31B_URL, the pipeline nodes will automatically use mock data (defined in backend/llm/_mock_structured_client.py). This allows UI/UX and backend engineers to build and test the LangGraph workflow without a running LLM.
You may notice a file at k8s/gemma4-31b-deployment.yaml.
- What is it? This was our previous attempt to run the model manually on a self-managed GKE cluster using an RTX Pro 6000 node pool.
- Status: It is currently scaled to 0 replicas (
kubectl scale deployment gemma4-31b -n lexiplan --replicas=0) to avoid conflicting with the Model Garden deployment and to save costs. - Usage: It remains in the repository solely as a backup configuration in case we ever need to move away from the managed Model Garden service back to raw GKE pods.