Skip to content

Commit f9401a6

Browse files
committed
Add Semantc Rtr support for Istio-Envoy ExtProc gw
Signed-off-by: Sanjeev Rampal <[email protected]>
1 parent a0f0581 commit f9401a6

21 files changed

+990
-17
lines changed

deploy/kubernetes/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ The deployment consists of:
2626
### Standard Kubernetes Deployment
2727

2828
```bash
29-
kubectl apply -k deploy/kubernetes/
29+
kubectl apply -k deploy/kubernetes/base/
3030

3131
# Check deployment status
3232
kubectl get pods -l app=semantic-router -n semantic-router
@@ -83,7 +83,7 @@ kubectl wait --for=condition=Ready nodes --all --timeout=300s
8383
**Step 2: Deploy the application**
8484

8585
```bash
86-
kubectl apply -k deploy/kubernetes/
86+
kubectl apply -k deploy/kubernetes/base/
8787

8888
# Wait for deployment to be ready
8989
kubectl wait --for=condition=Available deployment/semantic-router -n semantic-router --timeout=600s
@@ -189,7 +189,7 @@ Or using kubectl/kind directly:
189189

190190
```bash
191191
# Remove deployment
192-
kubectl delete -k deploy/kubernetes/
192+
kubectl delete -k deploy/kubernetes/base/
193193

194194
# Delete the kind cluster
195195
kind delete cluster --name semantic-router-cluster
@@ -305,7 +305,7 @@ Edit the `resources` section in `deployment.yaml` accordingly.
305305

306306
## Files Overview
307307

308-
### Kubernetes Manifests (`deploy/kubernetes/`)
308+
### Kubernetes Manifests (`deploy/kubernetes/base`)
309309

310310
- `deployment.yaml` - Main application deployment with optimized resource settings
311311
- `service.yaml` - Services for gRPC, HTTP API, and metrics

deploy/kubernetes/ai-gateway/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ Deploy the semantic router service with all required components:
4040

4141
```bash
4242
# Deploy semantic router using Kustomize
43-
kubectl apply -k deploy/kubernetes/
43+
kubectl apply -k deploy/kubernetes/base/
4444

4545
# Wait for deployment to be ready (this may take several minutes for model downloads)
4646
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
@@ -250,7 +250,7 @@ kubectl delete -f deploy/kubernetes/ai-gateway/inference-pool
250250
kubectl delete -f deploy/kubernetes/ai-gateway/configuration
251251

252252
# Remove semantic router
253-
kubectl delete -k deploy/kubernetes/
253+
kubectl delete -k deploy/kubernetes/base/
254254

255255
# Remove AI gateway
256256
helm uninstall aieg -n envoy-ai-gateway-system
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

deploy/kubernetes/istio/README.md

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# Install in Kubernetes
2+
3+
This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vsr with it. There are multiple topologies possible to combine Istio Gateway with vsr. This document describes one of the common topologies.
4+
5+
6+
## Architecture Overview
7+
8+
The deployment consists of:
9+
10+
- **vLLM Semantic Router**: Provides intelligent request routing and classification
11+
- **Istio Gateway**: Istio Gateway that uses an Envoy proxy under the covers
12+
- **Gateway API Inference Extension**: Additional control and data plane for endpoint picking that can optionally attach to the same Istio gateway as vLLM Semantic Router.
13+
- **Two instances of vLLM serving 1 model each**: Example backend LLMs for illustrating semantic routing in this topology
14+
15+
## Prerequisites
16+
17+
Before starting, ensure you have the following tools installed:
18+
19+
- [Docker](https://docs.docker.com/get-docker/) - Container runtime
20+
- [minikube](https://minikube.sigs.k8s.io/docs/start/) - Local Kubernetes
21+
- [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) - Kubernetes in Docker
22+
- [kubectl](https://kubernetes.io/docs/tasks/tools/) - Kubernetes CLI
23+
24+
Either minikube or kind works to deploy a local kubernetes cluster needed for this exercise so you only need one of these two. We use minikube in the description below but the same steps should work with a Kind cluster once the cluster is created in Step 1.
25+
26+
We will also deploy two different LLMs in this exercise to illustrate the semantic routing and model routing function more clearly so you ideally you should run this on a machine that has GPU support to run the two models used in this exercise and adequate memory and storage for these models. You can also use equivalent steps on a smaller server that runs smaller LLMs on a CPU based server without GPUs.
27+
28+
29+
## Step 1: Create Minikube Cluster
30+
31+
Create a local Kubernetes cluster via minikube (or equivalently via Kind).
32+
33+
```bash
34+
# Create cluster
35+
$ minikube start \
36+
--driver docker \
37+
--container-runtime docker \
38+
--gpus all \
39+
--memory no-limit \
40+
--cpus no-limit
41+
42+
# Verify cluster is ready
43+
$ kubectl wait --for=condition=Ready nodes --all --timeout=300s
44+
```
45+
46+
## Step 2: Deploy LLM models service
47+
48+
As noted earlier in this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). In this exercise we chose to serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. For this exercise you may choose to use any inference server to serve these models but we have provided manifests to run these in vLLM containers as a reference.
49+
50+
51+
```bash
52+
# Create vLLM service running llama3-8b
53+
kubectl apply -f deploy/kubernetes/istio/vLlama3.yaml
54+
```
55+
This may take several (10+) minutes the first time this is run to download the model up until the vLLM pod running this model is in READY state. Similarly also deploy the second LLM (phi4-mini) and wait for several minutes until the pod is in READY state..
56+
57+
```bash
58+
# Create vLLM service running phi4-mini
59+
kubectl apply -f deploy/kubernetes/istio/vPhi4.yaml
60+
```
61+
62+
At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services explosing the IP/ port on which these models are being served. In th example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80.
63+
64+
65+
```bash
66+
# Verify that vLLM pods running the two LLMs are READY and serving
67+
68+
kubectl get pods
69+
NAME READY STATUS RESTARTS AGE
70+
llama-8b-57b95475bd-ph7s4 1/1 Running 0 9d
71+
phi4-mini-887476b56-74twv 1/1 Running 0 9d
72+
73+
# View the IP/port of the Kubernetes services on which these models are being served
74+
75+
kubectl get service
76+
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
77+
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 36d
78+
llama-8b ClusterIP 10.108.250.109 <none> 80/TCP 18d
79+
phi4-mini ClusterIP 10.97.252.33 <none> 80/TCP 9d
80+
```
81+
82+
## Step 3: Update vsr config if needed
83+
84+
The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. The example config file provided already in this repo should work f you use the same LLMs as in this exercise but you can choose to play with this config to enable or disable individual vsr features. Ensure that your vllm_endpoints in the file match the ip/ port of the llm services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features as described elsewhere in the vsr documentation.
85+
86+
87+
## Step 4: Deploy vLLM Semantic Router
88+
89+
Deploy the semantic router service with all required components:
90+
91+
```bash
92+
# Deploy semantic router using Kustomize
93+
kubectl apply -k deploy/kubernetes/istio/
94+
95+
# Wait for deployment to be ready (this may take several minutes for model downloads)
96+
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
97+
98+
# Verify deployment status
99+
kubectl get pods -n vllm-semantic-router-system
100+
```
101+
102+
## Step 5: Install Istio Gateway, Gateway API, Inference Extension
103+
104+
We will use a recent build of Istio for this exercise so that we have the option of also using the v1.0.0 GA version of the Gateway API Inference Extension CRDs and EPP functionality.
105+
106+
Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below.
107+
108+
```bash
109+
kubectl get crds | grep gateway
110+
```
111+
112+
```bash
113+
kubectl get crds | grep inference
114+
```
115+
116+
```bash
117+
kubectl get pods | grep istio
118+
```
119+
120+
```bash
121+
kubectl get pods -n istio-system
122+
```
123+
124+
## Step 6: Install additional Istio configuration
125+
126+
Install the destinationrule and envoy filter needed for Istio gateway to use ExtProc based interface with vLLM Semantic router
127+
128+
129+
```bash
130+
kubectl apply -f deploy/kubernetes/istio/destinationrule.yaml
131+
kubectl apply -f deploy/kubernetes/istio/envoyfilter.yaml
132+
```
133+
134+
## Step 7: Install gateway routes
135+
136+
Install HTTPRoutes in the Istio gateway.
137+
138+
```bash
139+
kubectl apply -f deploy/kubernetes/istio/httproute-llama3-8b.yaml
140+
kubectl apply -f deploy/kubernetes/istio/httproute-phi4-mini.yaml
141+
```
142+
143+
## Testing the Deployment
144+
To expose the IP on which the Istio gateway listens to client requests from outside the cluster, you can choose any standard kubernetes option for external load balancing. We tested our feature by [deploying and configuring metallb](https://metallb.universe.tf/installation/) into the cluster to be the LoadBalancer provider. Please refer to metallb documentation for installation procedures if needed. Finally, for the minikube case, we get the external url as shown below.
145+
146+
```bash
147+
minikube service inference-gateway-istio --url
148+
http://192.168.49.2:30913
149+
```
150+
151+
Now we can send LLM prompts via curl to http://192.168.49.2:30913 to access the Istio gateway which will then use information from vLLM semantic router to dynamically route to one of the two LLMs we are using as backends in this case.
152+
153+
154+
### Send Test Requests
155+
156+
Try the following cases with and without model "auto" selection to confirm that Istio + vsr together are able to route queries to the appropriate model. The query responses will include information about which model was used to serve that request.
157+
158+
Example queries to try include the following
159+
160+
```bash
161+
# Model name llama3-8b provided explicitly, should route to this backend
162+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
163+
"model": "llama3-8b",
164+
"messages": [
165+
{"role": "user", "content": "Linux is said to be an open source kernel because "}
166+
],
167+
"max_tokens": 100,
168+
"temperature": 0
169+
}'
170+
```
171+
172+
```bash
173+
# Model name set to "auto", should categorize to "computer science" & route to llama3-8b
174+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
175+
"model": "auto",
176+
"messages": [
177+
{"role": "user", "content": "Linux is said to be an open source kernel because "}
178+
],
179+
"max_tokens": 100,
180+
"temperature": 0
181+
}'
182+
```
183+
184+
185+
```bash
186+
# Model name phi4-mini provided explicitly, should route to this backend
187+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
188+
"model": "phi4-mini",
189+
"messages": [
190+
{"role": "user", "content": "2+2 is "}
191+
],
192+
"max_tokens": 100,
193+
"temperature": 0
194+
}'
195+
```
196+
197+
```bash
198+
# Model name set to "auto", should categorize to "math" & route to llama3-8b
199+
curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{
200+
"model": "auto",
201+
"messages": [
202+
{"role": "user", "content": "2+2 is "}
203+
],
204+
"max_tokens": 100,
205+
"temperature": 0
206+
}'
207+
```
208+
209+
210+
## Troubleshooting
211+
212+
### Common Issues
213+
214+
**Gateway/ Front end not working:**
215+
216+
```bash
217+
# Check istio gateway status
218+
kubectl get gateway
219+
220+
# Check istio gw service status
221+
kubectl get svc inference-gateway-istio
222+
223+
# Check Istio's Envoy logs
224+
kubectl logs deploy/inference-gateway-istio -c istio-proxy
225+
```
226+
227+
**Semantic router not responding:**
228+
229+
```bash
230+
# Check semantic router pod
231+
kubectl get pods -n vllm-semantic-router-system
232+
233+
# Check semantic router service
234+
kubectl get svc -n vllm-semantic-router-system
235+
236+
# Check semantic router logs
237+
kubectl logs -n vllm-semantic-router-system deployment/semantic-router
238+
```
239+
240+
## Cleanup
241+
242+
```bash
243+
244+
# Remove semantic router
245+
kubectl delete -k deploy/kubernetes/istio/
246+
247+
# Remove Istio
248+
istioctl uninstall --purge
249+
250+
# Remove LLMs
251+
kubectl delete -f deploy/kubernetes/istio/vLlama3.yaml
252+
kubectl delete -f deploy/kubernetes/istio/vPhi4.yaml
253+
254+
# Stop minikube cluster
255+
minikube stop
256+
257+
# Delete minikube cluster
258+
minikube delete
259+
```
260+
261+
## Next Steps
262+
263+
- Test/ experiment with different features of vLLM Semantic Router
264+
- Additional use cases/ topologies with Istio Gateway (including with EPP and LLM-D)
265+
- Set up monitoring and observability
266+
- Implement authentication and authorization
267+
- Scale the semantic router deployment for production workloads

0 commit comments

Comments
 (0)