Skip to content

Latest commit

 

History

History
63 lines (45 loc) · 3.38 KB

File metadata and controls

63 lines (45 loc) · 3.38 KB

Goal Description

You want to remain fully self-hosted using vLLM on your GKE cluster with the RTX Pro 6000 GPU, but you need an option that is comparable in capability to Gemma 4 31B while being significantly faster.

Why was Gemma 31B Slow / Failing?

An RTX Pro 6000 has 48GB of VRAM. A 31 billion parameter model loaded in 8-bit quantization takes up ~32GB of VRAM just for the model weights. This leaves very little room for the KV Cache (the memory used to store context during inference).

When the KV Cache is constrained:

  1. Speed plummets because the GPU cannot process requests in parallel (batching drops).
  2. Context limits are hit quickly, causing Out-Of-Memory (OOM) errors or forcing the engine to swap to CPU RAM, which slows inference to a crawl.

Our Next Powerful (and Faster) Self-Hosted Options

To achieve high speed on a single 48GB GPU, we need a model that can run in native bfloat16 (16-bit) without quantization, leaving 20GB+ of VRAM strictly for the KV cache. This will drastically improve generation speed and time-to-first-token.

Option A: Llama 3.1 8B Instruct — The Industry Standard for Speed

  • Model: meta-llama/Meta-Llama-3.1-8B-Instruct
  • Why: Despite being 8B parameters, Llama 3.1 rivals the performance of older 30B+ models in reasoning and instruction following. It will consume only ~16GB of VRAM natively.
  • Speed: Extremely fast (expect 70-100+ tokens/sec on an RTX Pro 6000).
  • VRAM Usage: ~16GB weights + 30GB KV Cache = Perfect fit.

Option B: Qwen 2.5 14B Instruct — The Capability Sweet Spot

  • Model: Qwen/Qwen2.5-14B-Instruct
  • Why: If you feel an 8B model isn't quite powerful enough for deep legal extraction, the 14B class is the perfect middle ground. Qwen 2.5 14B often beats 30B+ models on reasoning benchmarks.
  • Speed: Fast (expect 40-60 tokens/sec).
  • VRAM Usage: ~28GB weights + 18GB KV Cache = Comfortable fit.

Option C: Gemma 2 9B IT — The Google Alternative

  • Model: google/gemma-2-9b-it
  • Why: If you specifically want to stay within the Gemma family, Gemma 2 9B is exceptionally powerful for its size and uses Sliding Window Attention to save memory.

Proposed Changes

If we proceed with Option A (Llama 3.1 8B) or Option B (Qwen 2.5 14B):

1. Update Kubernetes Deployment

[MODIFY] k8s/gemma4-31b-deployment.yaml

  • Rename deployment to reflect the new model.
  • Change --model=google/gemma-4-31B-it to --model=meta-llama/Meta-Llama-3.1-8B-Instruct
  • Remove --quantization=bitsandbytes so it runs at native bfloat16 speed.
  • Add --dtype=bfloat16 and --enable-prefix-caching for maximum RAG performance.

2. Update Backend Configuration

[MODIFY] backend/llm/gemma_client.py

  • Change the model parameter inside llm_heavy and llm_light to match the new Llama/Qwen model name.
  • Remove Gemma-specific configurations if necessary.

[MODIFY] backend/llm/langchain_adapter.py

  • Update the model_name fallback strings.

User Review Required

Important

Which self-hosted model would you like to switch to?

  1. Llama 3.1 8B (Maximum speed, huge context window)
  2. Qwen 2.5 14B (Maximum reasoning capability while still fitting comfortably on 1 GPU)
  3. Gemma 2 9B (Stay in the Gemma family)

Let me know, and I will update the Kubernetes YAML and Python code immediately so you can kubectl apply and get moving!