Skip to content

Commit df77f01

Browse files
authored
Add BBR user guide, yaml for model-aware routing (#1498)
1 parent 0a28208 commit df77f01

File tree

7 files changed

+314
-91
lines changed

7 files changed

+314
-91
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,8 @@ The following specific terms to this project:
2929
performance, availability and capabilities to optimize routing. Includes
3030
things like [Prefix Cache] status or [LoRA Adapters] availability.
3131
- **Endpoint Picker(EPP)**: An implementation of an `Inference Scheduler` with additional Routing, Flow, and Request Control layers to allow for sophisticated routing strategies. Additional info on the architecture of the EPP [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/0683-epp-architecture-proposal).
32-
32+
- **Body Based Router(BBR)**: An optional additional [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) server that parses the http body of the inference prompt message and extracts information (currently the model name for OpenAI API style messages) into a format which can then be used by the gateway for routing purposes. Additional info [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/bbr/README.md) and in the documentation [user guides](https://gateway-api-inference-extension.sigs.k8s.io/guides/).
33+
3334

3435
The following are key industry terms that are important to understand for
3536
this project:
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
---
2+
apiVersion: gateway.networking.k8s.io/v1
3+
kind: HTTPRoute
4+
metadata:
5+
name: llm-llama-route
6+
spec:
7+
parentRefs:
8+
- group: gateway.networking.k8s.io
9+
kind: Gateway
10+
name: inference-gateway
11+
rules:
12+
- backendRefs:
13+
- group: inference.networking.k8s.io
14+
kind: InferencePool
15+
name: vllm-llama3-8b-instruct
16+
matches:
17+
- path:
18+
type: PathPrefix
19+
value: /
20+
headers:
21+
- type: Exact
22+
name: X-Gateway-Model-Name
23+
value: 'meta-llama/Llama-3.1-8B-Instruct'
24+
timeouts:
25+
request: 300s
26+
---
27+
apiVersion: gateway.networking.k8s.io/v1
28+
kind: HTTPRoute
29+
metadata:
30+
name: llm-phi4-route
31+
spec:
32+
parentRefs:
33+
- group: gateway.networking.k8s.io
34+
kind: Gateway
35+
name: inference-gateway
36+
rules:
37+
- backendRefs:
38+
- group: inference.networking.k8s.io
39+
kind: InferencePool
40+
name: vllm-phi4-mini-instruct
41+
matches:
42+
- path:
43+
type: PathPrefix
44+
value: /
45+
headers:
46+
- type: Exact
47+
name: X-Gateway-Model-Name
48+
value: 'microsoft/Phi-4-mini-instruct'
49+
timeouts:
50+
request: 300s
51+
---
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
apiVersion: v1
3+
kind: PersistentVolumeClaim
4+
metadata:
5+
name: phi4-mini
6+
namespace: default
7+
spec:
8+
accessModes:
9+
- ReadWriteOnce
10+
resources:
11+
requests:
12+
storage: 20Gi
13+
volumeMode: Filesystem
14+
---
15+
apiVersion: apps/v1
16+
kind: Deployment
17+
metadata:
18+
name: phi4-mini
19+
namespace: default
20+
labels:
21+
app: phi4-mini
22+
spec:
23+
replicas: 1
24+
selector:
25+
matchLabels:
26+
app: phi4-mini
27+
template:
28+
metadata:
29+
labels:
30+
app: phi4-mini
31+
spec:
32+
volumes:
33+
- name: cache-volume
34+
persistentVolumeClaim:
35+
claimName: phi4-mini
36+
containers:
37+
- name: phi4-mini
38+
image: vllm/vllm-openai:latest
39+
command: ["/bin/sh", "-c"]
40+
args: [
41+
"vllm serve microsoft/Phi-4-mini-instruct --trust-remote-code --enable-chunked-prefill"
42+
]
43+
env:
44+
- name: HUGGING_FACE_HUB_TOKEN
45+
valueFrom:
46+
secretKeyRef:
47+
name: hf-token
48+
key: token
49+
ports:
50+
- containerPort: 8000
51+
resources:
52+
limits:
53+
nvidia.com/gpu: "1"
54+
requests:
55+
nvidia.com/gpu: "1"
56+
volumeMounts:
57+
- mountPath: /root/.cache/huggingface
58+
name: cache-volume
59+
livenessProbe:
60+
httpGet:
61+
path: /health
62+
port: 8000
63+
initialDelaySeconds: 600
64+
periodSeconds: 10
65+
readinessProbe:
66+
httpGet:
67+
path: /health
68+
port: 8000
69+
initialDelaySeconds: 600
70+
periodSeconds: 5
71+
---
72+
apiVersion: v1
73+
kind: Service
74+
metadata:
75+
name: phi4-mini
76+
namespace: default
77+
spec:
78+
ports:
79+
- name: http-phi4-mini
80+
port: 80
81+
protocol: TCP
82+
targetPort: 8000
83+
# The label selector should match the deployment labels & it is useful for prefix caching feature
84+
selector:
85+
app: phi4-mini
86+
sessionAffinity: None
87+
type: ClusterIP
88+

pkg/bbr/README.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,3 @@ body of the HTTP request. However, most implementations do not support routing
88
based on the request body. This extension helps bridge that gap for clients.
99
This extension works by parsing the request body. If it finds a `model` parameter in the
1010
request body, it will copy the value of that parameter into a request header.
11-
12-
This extension is intended to be paired with an `ext_proc` capable Gateway. There is not
13-
a standard way to represent this kind of extension in Gateway API yet, so we recommend
14-
referring to implementation-specific documentation for how to deploy this extension.

site-src/guides/index.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ Tooling:
101101
=== "Istio"
102102

103103
```bash
104-
export GATEWAY_PROVIDER=none
104+
export GATEWAY_PROVIDER=istio
105105
helm install vllm-llama3-8b-instruct \
106106
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
107107
--set provider.name=$GATEWAY_PROVIDER \
@@ -319,6 +319,10 @@ Tooling:
319319
kubectl get httproute llm-route -o yaml
320320
```
321321

322+
### Deploy the Body Based Router Extension (Optional)
323+
324+
This guide shows how to get started with serving only 1 base model type per L7 URL path. If in addition, you wish to exercise model-aware routing such that more than 1 base model is served at the same L7 url path, that requires use of the (optional) Body Based Routing (BBR) extension which is described in a following section of the guide, namely the [`Serving Multiple GenAI Models`](serve-multiple-genai-models.md) section.
325+
322326
### Deploy InferenceObjective (Optional)
323327

324328
Deploy the sample InferenceObjective which allows you to specify priority of requests.

0 commit comments

Comments
 (0)