Add BBR user guide, yaml for model-aware routing (#1498)

srampal · web-flow · commit df77f017e656 · 2025-09-16T15:48:09.000-07:00
diff --git a/README.md b/README.md
@@ -29,7 +29,8 @@ The following specific terms to this project:
   performance, availability and capabilities to optimize routing. Includes
   things like [Prefix Cache] status or [LoRA Adapters] availability.
 - **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
-  
+- **Body Based Router(BBR)**: An optional additional [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) server that parses the http body of the inference prompt message and extracts information (currently the model name for OpenAI API style messages) into a format which can then be used by the gateway for routing purposes. Additional info [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/bbr/README.md) and in the documentation [user guides](https://gateway-api-inference-extension.sigs.k8s.io/guides/). 
+ 
 
 The following are key industry terms that are important to understand for
 this project:
diff --git a/config/manifests/bbr-example/httproute_bbr.yaml b/config/manifests/bbr-example/httproute_bbr.yaml
@@ -0,0 +1,51 @@
+---
+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+  name: llm-llama-route
+spec:
+  parentRefs:
+  - group: gateway.networking.k8s.io
+    kind: Gateway
+    name: inference-gateway
+  rules:
+  - backendRefs:
+    - group: inference.networking.k8s.io
+      kind: InferencePool
+      name: vllm-llama3-8b-instruct
+    matches:
+    - path:
+        type: PathPrefix
+        value: /
+      headers:
+        - type: Exact
+          name: X-Gateway-Model-Name
+          value: 'meta-llama/Llama-3.1-8B-Instruct'
+    timeouts:
+      request: 300s
+---
+apiVersion: gateway.networking.k8s.io/v1
+kind: HTTPRoute
+metadata:
+  name: llm-phi4-route
+spec:
+  parentRefs:
+  - group: gateway.networking.k8s.io
+    kind: Gateway
+    name: inference-gateway
+  rules:
+  - backendRefs:
+    - group: inference.networking.k8s.io
+      kind: InferencePool
+      name: vllm-phi4-mini-instruct
+    matches:
+    - path:
+        type: PathPrefix
+        value: /
+      headers:
+        - type: Exact
+          name: X-Gateway-Model-Name
+          value: 'microsoft/Phi-4-mini-instruct'
+    timeouts:
+      request: 300s
+---
diff --git a/config/manifests/bbr-example/vllm-phi4-mini.yaml b/config/manifests/bbr-example/vllm-phi4-mini.yaml
@@ -0,0 +1,88 @@
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: phi4-mini
+  namespace: default
+spec:
+  accessModes:
+  - ReadWriteOnce
+  resources:
+    requests:
+      storage: 20Gi
+  volumeMode: Filesystem
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: phi4-mini
+  namespace: default
+  labels:
+    app: phi4-mini
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: phi4-mini
+  template:
+    metadata:
+      labels:
+        app: phi4-mini
+    spec:
+      volumes:
+      - name: cache-volume
+        persistentVolumeClaim:
+          claimName: phi4-mini
+      containers:
+      - name: phi4-mini
+        image: vllm/vllm-openai:latest
+        command: ["/bin/sh", "-c"]
+        args: [
+          "vllm serve microsoft/Phi-4-mini-instruct --trust-remote-code --enable-chunked-prefill" 
+        ]
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token
+              key: token
+        ports:
+        - containerPort: 8000
+        resources:
+          limits:
+            nvidia.com/gpu: "1"
+          requests:
+            nvidia.com/gpu: "1"
+        volumeMounts:
+        - mountPath: /root/.cache/huggingface
+          name: cache-volume
+        livenessProbe:
+          httpGet:
+            path: /health
+            port: 8000
+          initialDelaySeconds: 600
+          periodSeconds: 10
+        readinessProbe:
+          httpGet:
+            path: /health
+            port: 8000
+          initialDelaySeconds: 600
+          periodSeconds: 5
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: phi4-mini
+  namespace: default
+spec:
+  ports:
+  - name: http-phi4-mini
+    port: 80
+    protocol: TCP
+    targetPort: 8000
+  # The label selector should match the deployment labels & it is useful for prefix caching feature
+  selector:
+    app: phi4-mini
+  sessionAffinity: None
+  type: ClusterIP
+
diff --git a/pkg/bbr/README.md b/pkg/bbr/README.md
@@ -8,7 +8,3 @@ body of the HTTP request. However, most implementations do not support routing
 based on the request body. This extension helps bridge that gap for clients.
 This extension works by parsing the request body. If it finds a `model` parameter in the
 request body, it will copy the value of that parameter into a request header.
-
-This extension is intended to be paired with an `ext_proc` capable Gateway. There is not
-a standard way to represent this kind of extension in Gateway API yet, so we recommend
-referring to implementation-specific documentation for how to deploy this extension.
diff --git a/site-src/guides/index.md b/site-src/guides/index.md
@@ -101,7 +101,7 @@ Tooling:
 === "Istio"
 
       ```bash
-      export GATEWAY_PROVIDER=none
+      export GATEWAY_PROVIDER=istio
       helm install vllm-llama3-8b-instruct \
       --set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
       --set provider.name=$GATEWAY_PROVIDER \
@@ -319,6 +319,10 @@ Tooling:
          kubectl get httproute llm-route -o yaml
          ```
 
+### Deploy the Body Based Router Extension (Optional)
+
+    This guide shows how to get started with serving only 1 base model type per L7 URL path. If in addition, you wish to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a following section of the guide, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section.
+
 ### Deploy InferenceObjective (Optional)
 
    Deploy the sample InferenceObjective which allows you to specify priority of requests.
diff --git a/site-src/guides/serve-multiple-genai-models.md b/site-src/guides/serve-multiple-genai-models.md
diff --git a/site-src/index.md b/site-src/index.md