diff --git a/deploy/kubernetes/istio/README.md b/deploy/kubernetes/istio/README.md new file mode 100644 index 00000000..5742624e --- /dev/null +++ b/deploy/kubernetes/istio/README.md @@ -0,0 +1,259 @@ +# vLLM Semantic Router as ExtProc server for Istio Gateway + +This guide provides step-by-step instructions for deploying the vLLM Semantic Router (vsr) with Istio Gateway on Kubernetes. Istio Gateway uses Envoy under the covers so it is possible to use vsr with it. There are multiple topologies possible to combine Istio Gateway with vsr. This document describes one of the common topologies. + +## Architecture Overview + +The deployment consists of: + +- **vLLM Semantic Router**: Provides intelligent request routing and classification +- **Istio Gateway**: Istio Gateway that uses an Envoy proxy under the covers +- **Gateway API Inference Extension**: Additional control and data plane for endpoint picking that can optionally attach to the same Istio gateway as vLLM Semantic Router. +- **Two instances of vLLM serving 1 model each**: Example backend LLMs for illustrating semantic routing in this topology + +## Prerequisites + +Before starting, ensure you have the following tools installed: + +- [Docker](https://docs.docker.com/get-docker/) - Container runtime +- [minikube](https://minikube.sigs.k8s.io/docs/start/) - Local Kubernetes +- [kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation) - Kubernetes in Docker +- [kubectl](https://kubernetes.io/docs/tasks/tools/) - Kubernetes CLI + +Either minikube or kind works to deploy a local kubernetes cluster needed for this exercise so you only need one of these two. We use minikube in the description below but the same steps should work with a Kind cluster once the cluster is created in Step 1. + +We will also deploy two different LLMs in this exercise to illustrate the semantic routing and model routing function more clearly so you ideally you should run this on a machine that has GPU support to run the two models used in this exercise and adequate memory and storage for these models. You can also use equivalent steps on a smaller server that runs smaller LLMs on a CPU based server without GPUs. + +## Step 1: Create Minikube Cluster + +Create a local Kubernetes cluster via minikube (or equivalently via Kind). + +```bash +# Create cluster +$ minikube start \ + --driver docker \ + --container-runtime docker \ + --gpus all \ + --memory no-limit \ + --cpus no-limit + +# Verify cluster is ready +$ kubectl wait --for=condition=Ready nodes --all --timeout=300s +``` + +## Step 2: Deploy LLM models service + +As noted earlier in this exercise we deploy two LLMs viz. a llama3-8b model (meta-llama/Llama-3.1-8B-Instruct) and a phi4-mini model (microsoft/Phi-4-mini-instruct). In this exercise we chose to serve these models using two separate instances of the [vLLM inference server](https://docs.vllm.ai/en/latest/) running in the default namespace of the kubernetes cluster. For this exercise you may choose to use any inference server to serve these models but we have provided manifests to run these in vLLM containers as a reference. + +```bash +# Create vLLM service running llama3-8b +kubectl apply -f deploy/kubernetes/istio/vLlama3.yaml +``` + +This may take several (10+) minutes the first time this is run to download the model up until the vLLM pod running this model is in READY state. Similarly also deploy the second LLM (phi4-mini) and wait for several minutes until the pod is in READY state.. + +```bash +# Create vLLM service running phi4-mini +kubectl apply -f deploy/kubernetes/istio/vPhi4.yaml +``` + +At the end of this you should be able to see both your vLLM pods are READY and serving these LLMs using the command below. You should also see Kubernetes services explosing the IP/ port on which these models are being served. In th example below the llama3-8b model is being served via a kubernetes service with service IP of 10.108.250.109 and port 80. + +```bash +# Verify that vLLM pods running the two LLMs are READY and serving + +kubectl get pods +NAME READY STATUS RESTARTS AGE +llama-8b-57b95475bd-ph7s4 1/1 Running 0 9d +phi4-mini-887476b56-74twv 1/1 Running 0 9d + +# View the IP/port of the Kubernetes services on which these models are being served + +kubectl get service +NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE +kubernetes ClusterIP 10.96.0.1 443/TCP 36d +llama-8b ClusterIP 10.108.250.109 80/TCP 18d +phi4-mini ClusterIP 10.97.252.33 80/TCP 9d +``` + +## Step 3: Update vsr config if needed + +The file deploy/kubernetes/istio/config.yaml will get used to configure vsr when it is installed in the next step. The example config file provided already in this repo should work f you use the same LLMs as in this exercise but you can choose to play with this config to enable or disable individual vsr features. Ensure that your vllm_endpoints in the file match the ip/ port of the llm services you are running. It is usually good to start with basic features of vsr such as prompt classification and model routing before experimenting with other features as described elsewhere in the vsr documentation. + +## Step 4: Deploy vLLM Semantic Router + +Deploy the semantic router service with all required components: + +```bash +# Deploy semantic router using Kustomize +kubectl apply -k deploy/kubernetes/istio/ + +# Wait for deployment to be ready (this may take several minutes for model downloads) +kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s + +# Verify deployment status +kubectl get pods -n vllm-semantic-router-system +``` + +## Step 5: Install Istio Gateway, Gateway API, Inference Extension + +We will use a recent build of Istio for this exercise so that we have the option of also using the v1.0.0 GA version of the Gateway API Inference Extension CRDs and EPP functionality. + +Follow the procedures described in the Gateway API [Inference Extensions documentation](https://gateway-api-inference-extension.sigs.k8s.io/guides/) to deploy the 1.28 (or newer) version of Istio Gateway, the Kubernetes Gateway API CRDs and the Gateway API Inference Extension v1.0.0. Do not install any of the HTTPRoute resources from that guide however, just use it to deploy the Istio gateway and CRDs. If installed correctly you should see the api CRDs for gateway api and inference extension as well as pods running for the Istio gateway and Istiod using the commands shown below. + +```bash +kubectl get crds | grep gateway +``` + +```bash +kubectl get crds | grep inference +``` + +```bash +kubectl get pods | grep istio +``` + +```bash +kubectl get pods -n istio-system +``` + +## Step 6: Install additional Istio configuration + +Install the destinationrule and envoy filter needed for Istio gateway to use ExtProc based interface with vLLM Semantic router + +```bash +kubectl apply -f deploy/kubernetes/istio/destinationrule.yaml +kubectl apply -f deploy/kubernetes/istio/envoyfilter.yaml +``` + +## Step 7: Install gateway routes + +Install HTTPRoutes in the Istio gateway. + +```bash +kubectl apply -f deploy/kubernetes/istio/httproute-llama3-8b.yaml +kubectl apply -f deploy/kubernetes/istio/httproute-phi4-mini.yaml +``` + +## Testing the Deployment +To expose the IP on which the Istio gateway listens to client requests from outside the cluster, you can choose any standard kubernetes option for external load balancing. We tested our feature by [deploying and configuring metallb](https://metallb.universe.tf/installation/) into the cluster to be the LoadBalancer provider. Please refer to metallb documentation for installation procedures if needed. Finally, for the minikube case, we get the external url as shown below. + +```bash +minikube service inference-gateway-istio --url +http://192.168.49.2:30913 +``` + +Now we can send LLM prompts via curl to http://192.168.49.2:30913 to access the Istio gateway which will then use information from vLLM semantic router to dynamically route to one of the two LLMs we are using as backends in this case. + +### Send Test Requests + +Try the following cases with and without model "auto" selection to confirm that Istio + vsr together are able to route queries to the appropriate model. The query responses will include information about which model was used to serve that request. + +Example queries to try include the following + +```bash +# Model name llama3-8b provided explicitly, should route to this backend +curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "llama3-8b", + "messages": [ + {"role": "user", "content": "Linux is said to be an open source kernel because "} + ], + "max_tokens": 100, + "temperature": 0 + }' +``` + +```bash +# Model name set to "auto", should categorize to "computer science" & route to llama3-8b +curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "auto", + "messages": [ + {"role": "user", "content": "Linux is said to be an open source kernel because "} + ], + "max_tokens": 100, + "temperature": 0 + }' +``` + +```bash +# Model name phi4-mini provided explicitly, should route to this backend +curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "phi4-mini", + "messages": [ + {"role": "user", "content": "2+2 is "} + ], + "max_tokens": 100, + "temperature": 0 + }' +``` + +```bash +# Model name set to "auto", should categorize to "math" & route to phi4-mini +curl http://192.168.49.2:30913/v1/chat/completions -H "Content-Type: application/json" -d '{ + "model": "auto", + "messages": [ + {"role": "user", "content": "2+2 is "} + ], + "max_tokens": 100, + "temperature": 0 + }' +``` + +## Troubleshooting + +### Common Issues + +**Gateway/ Front end not working:** + +```bash +# Check istio gateway status +kubectl get gateway + +# Check istio gw service status +kubectl get svc inference-gateway-istio + +# Check Istio's Envoy logs +kubectl logs deploy/inference-gateway-istio -c istio-proxy +``` + +**Semantic router not responding:** + +```bash +# Check semantic router pod +kubectl get pods -n vllm-semantic-router-system + +# Check semantic router service +kubectl get svc -n vllm-semantic-router-system + +# Check semantic router logs +kubectl logs -n vllm-semantic-router-system deployment/semantic-router +``` + +## Cleanup + +```bash + +# Remove semantic router +kubectl delete -k deploy/kubernetes/istio/ + +# Remove Istio +istioctl uninstall --purge + +# Remove LLMs +kubectl delete -f deploy/kubernetes/istio/vLlama3.yaml +kubectl delete -f deploy/kubernetes/istio/vPhi4.yaml + +# Stop minikube cluster +minikube stop + +# Delete minikube cluster +minikube delete +``` + +## Next Steps + +- Test/ experiment with different features of vLLM Semantic Router +- Additional use cases/ topologies with Istio Gateway (including with EPP and LLM-D) +- Set up monitoring and observability +- Implement authentication and authorization +- Scale the semantic router deployment for production workloads diff --git a/deploy/kubernetes/istio/config.yaml b/deploy/kubernetes/istio/config.yaml new file mode 100644 index 00000000..8ce78ab3 --- /dev/null +++ b/deploy/kubernetes/istio/config.yaml @@ -0,0 +1,188 @@ +bert_model: + model_id: sentence-transformers/all-MiniLM-L12-v2 + threshold: 0.6 + use_cpu: true + +semantic_cache: + enabled: false + backend_type: "memory" # Options: "memory" or "milvus" + similarity_threshold: 0.8 + max_entries: 1000 # Only applies to memory backend + ttl_seconds: 3600 + eviction_policy: "fifo" + +tools: + enabled: false + top_k: 3 + similarity_threshold: 0.2 + tools_db_path: "config/tools_db.json" + fallback_to_empty: true + +prompt_guard: + enabled: false + use_modernbert: true + model_id: "models/jailbreak_classifier_modernbert-base_model" + threshold: 0.7 + use_cpu: true + jailbreak_mapping_path: "models/jailbreak_classifier_modernbert-base_model/jailbreak_type_mapping.json" + +# vLLM Endpoints Configuration +# IMPORTANT: 'address' field must be a valid IP address (IPv4 or IPv6) +# Supported formats: 127.0.0.1, 192.168.1.1, ::1, 2001:db8::1 +# NOT supported: domain names (example.com), protocol prefixes (http://), paths (/api), ports in address (use 'port' field) +vllm_endpoints: + - name: "endpoint1" + address: "10.104.192.205" # IPv4 address - REQUIRED format + port: 80 + models: + - "llama3-8b" + weight: 1 + - name: "endpoint2" + address: "10.99.27.202" # IPv4 address - REQUIRED format + port: 80 + models: + - "phi4-mini" + weight: 1 + +model_config: + "llama3-8b": + preferred_endpoints: ["endpoint1"] + pii_policy: + allow_by_default: false + "phi4-mini": + preferred_endpoints: ["endpoint2"] + pii_policy: + allow_by_default: false + +# Classifier configuration +classifier: + category_model: + model_id: "models/category_classifier_modernbert-base_model" + use_modernbert: true + threshold: 0.6 + use_cpu: true + category_mapping_path: "models/category_classifier_modernbert-base_model/category_mapping.json" + pii_model: + model_id: "models/pii_classifier_modernbert-base_presidio_token_model" + use_modernbert: true + threshold: 0.7 + use_cpu: true + pii_mapping_path: "models/pii_classifier_modernbert-base_presidio_token_model/pii_type_mapping.json" + +# Categories with new use_reasoning field structure +categories: + - name: business + model_scores: + - model: llama3-8b + score: 0.8 + use_reasoning: false # Business performs better without reasoning + - model: phi4-mini + score: 0.3 + use_reasoning: false # Business performs better without reasoning + - name: law + model_scores: + - model: llama3-8b + score: 0.7 + use_reasoning: false + - name: psychology + model_scores: + - model: llama3-8b + score: 0.7 + use_reasoning: false + - name: biology + model_scores: + - model: llama3-8b + score: 0.9 + use_reasoning: false + - name: chemistry + model_scores: + - model: llama3-8b + score: 0.6 + use_reasoning: false # Enable reasoning for complex chemistry + - name: history + model_scores: + - model: llama3-8b + score: 0.7 + use_reasoning: false + - name: other + model_scores: + - model: llama3-8b + score: 0.7 + use_reasoning: false + - name: health + model_scores: + - model: llama3-8b + score: 0.5 + use_reasoning: false + - name: economics + model_scores: + - model: llama3-8b + score: 0.8 + use_reasoning: false + - name: math + model_scores: + - model: phi4-mini + score: 0.8 + use_reasoning: false + - model: llama3-8b + score: 0.3 + use_reasoning: false + - name: physics + model_scores: + - model: llama3-8b + score: 0.7 + use_reasoning: false + - name: computer science + model_scores: + - model: llama3-8b + score: 0.7 + use_reasoning: false + - name: philosophy + model_scores: + - model: llama3-8b + score: 0.5 + use_reasoning: false + - name: engineering + model_scores: + - model: llama3-8b + score: 0.7 + use_reasoning: false + +default_model: llama3-8b + +# Reasoning family configurations +reasoning_families: + deepseek: + type: "chat_template_kwargs" + parameter: "thinking" + + qwen3: + type: "chat_template_kwargs" + parameter: "enable_thinking" + + gpt-oss: + type: "reasoning_effort" + parameter: "reasoning_effort" + gpt: + type: "reasoning_effort" + parameter: "reasoning_effort" + +# Global default reasoning effort level +default_reasoning_effort: high + +# API Configuration +api: + batch_classification: + max_batch_size: 100 + concurrency_threshold: 5 + max_concurrency: 8 + metrics: + enabled: true + detailed_goroutine_tracking: true + high_resolution_timing: false + sample_rate: 1.0 + duration_buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30] + size_buckets: [1, 2, 5, 10, 20, 50, 100, 200] + +# Gateway route cache clearing +clear_route_cache: true # Enable for testing diff --git a/deploy/kubernetes/istio/destinationrule.yaml b/deploy/kubernetes/istio/destinationrule.yaml new file mode 100644 index 00000000..e23c53b9 --- /dev/null +++ b/deploy/kubernetes/istio/destinationrule.yaml @@ -0,0 +1,12 @@ +apiVersion: networking.istio.io/v1 +kind: DestinationRule +metadata: + name: semantic-router + namespace: default +spec: + #host: semantic-router.default.svc.cluster.local + host: semantic-router.vllm-semantic-router-system.svc.cluster.local + trafficPolicy: + tls: + mode: DISABLE # TODO Change this once TLS is fully supported + insecureSkipVerify: true diff --git a/deploy/kubernetes/istio/envoyfilter.yaml b/deploy/kubernetes/istio/envoyfilter.yaml new file mode 100644 index 00000000..334f018a --- /dev/null +++ b/deploy/kubernetes/istio/envoyfilter.yaml @@ -0,0 +1,34 @@ +--- +apiVersion: networking.istio.io/v1alpha3 +kind: EnvoyFilter +metadata: + name: semantic-router + namespace: default +spec: + configPatches: + - applyTo: HTTP_FILTER + match: + # context omitted so that this applies to both sidecars and gateways + listener: + filterChain: + filter: + name: "envoy.filters.network.http_connection_manager" + patch: + operation: INSERT_FIRST + value: + name: envoy.filters.http.ext_proc + typed_config: + "@type": type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor + failure_mode_allow: true + allow_mode_override: true + message_timeout: 300s + processing_mode: + request_header_mode: "SEND" + response_header_mode: "SEND" + request_body_mode: "BUFFERED" + response_body_mode: "NONE" + request_trailer_mode: "SKIP" + response_trailer_mode: "SKIP" + grpc_service: + envoy_grpc: + cluster_name: outbound|50051||semantic-router.vllm-semantic-router-system.svc.cluster.local diff --git a/deploy/kubernetes/istio/httproute-llama3-8b.yaml b/deploy/kubernetes/istio/httproute-llama3-8b.yaml new file mode 100644 index 00000000..77fc5b58 --- /dev/null +++ b/deploy/kubernetes/istio/httproute-llama3-8b.yaml @@ -0,0 +1,26 @@ +--- +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: vsr-llama8b +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - name: llama-8b # service llama-8b + namespace: default + port: 80 + matches: + - path: + type: PathPrefix + value: / + headers: + - type: Exact + name: x-selected-model + value: 'llama3-8b' + timeouts: + request: 300s + diff --git a/deploy/kubernetes/istio/httproute-phi4-mini.yaml b/deploy/kubernetes/istio/httproute-phi4-mini.yaml new file mode 100644 index 00000000..9efb2f8d --- /dev/null +++ b/deploy/kubernetes/istio/httproute-phi4-mini.yaml @@ -0,0 +1,26 @@ +--- +apiVersion: gateway.networking.k8s.io/v1 +kind: HTTPRoute +metadata: + name: vsr-phi4-mini +spec: + parentRefs: + - group: gateway.networking.k8s.io + kind: Gateway + name: inference-gateway + rules: + - backendRefs: + - name: phi4-mini # service phi4-mini + namespace: default + port: 80 + matches: + - path: + type: PathPrefix + value: / + headers: + - type: Exact + name: x-selected-model + value: 'phi4-mini' + timeouts: + request: 300s + diff --git a/deploy/kubernetes/istio/kustomization.yaml b/deploy/kubernetes/istio/kustomization.yaml new file mode 100644 index 00000000..138c6709 --- /dev/null +++ b/deploy/kubernetes/istio/kustomization.yaml @@ -0,0 +1,34 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +resources: +- namespace.yaml +- pvc.yaml +- deployment.yaml +- service.yaml + +# Patch deployment to disable secure mode for Istio +patches: +- patch: |- + - op: replace + path: /spec/template/spec/containers/0/args + value: ["--secure=false"] + - op: add + path: /spec/template/spec/containers/0/imagePullPolicy + value: "IfNotPresent" + target: + kind: Deployment + name: semantic-router + +# Override the ConfigMap to use the different config.yaml +configMapGenerator: + - name: semantic-router-config + files: + - config.yaml + - tools_db.json + +namespace: vllm-semantic-router-system + +images: +- name: ghcr.io/vllm-project/semantic-router/extproc + newTag: latest diff --git a/deploy/kubernetes/istio/tools_db.json b/deploy/kubernetes/istio/tools_db.json new file mode 100644 index 00000000..dccbf48a --- /dev/null +++ b/deploy/kubernetes/istio/tools_db.json @@ -0,0 +1,142 @@ +[ + { + "tool": { + "type": "function", + "function": { + "name": "get_weather", + "description": "Get current weather information for a location", + "parameters": { + "type": "object", + "properties": { + "location": { + "type": "string", + "description": "The city and state, e.g. San Francisco, CA" + }, + "unit": { + "type": "string", + "enum": ["celsius", "fahrenheit"], + "description": "Temperature unit" + } + }, + "required": ["location"] + } + } + }, + "description": "Get current weather information, temperature, conditions, forecast for any location, city, or place. Check weather today, now, current conditions, temperature, rain, sun, cloudy, hot, cold, storm, snow", + "category": "weather", + "tags": ["weather", "temperature", "forecast", "climate"] + }, + { + "tool": { + "type": "function", + "function": { + "name": "search_web", + "description": "Search the web for information", + "parameters": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "The search query" + }, + "num_results": { + "type": "integer", + "description": "Number of results to return", + "default": 5 + } + }, + "required": ["query"] + } + } + }, + "description": "Search the internet, web search, find information online, browse web content, lookup, research, google, find answers, discover, investigate", + "category": "search", + "tags": ["search", "web", "internet", "information", "browse"] + }, + { + "tool": { + "type": "function", + "function": { + "name": "calculate", + "description": "Perform mathematical calculations", + "parameters": { + "type": "object", + "properties": { + "expression": { + "type": "string", + "description": "Mathematical expression to evaluate" + } + }, + "required": ["expression"] + } + } + }, + "description": "Calculate mathematical expressions, solve math problems, arithmetic operations, compute numbers, addition, subtraction, multiplication, division, equations, formula", + "category": "math", + "tags": ["math", "calculation", "arithmetic", "compute", "numbers"] + }, + { + "tool": { + "type": "function", + "function": { + "name": "send_email", + "description": "Send an email message", + "parameters": { + "type": "object", + "properties": { + "to": { + "type": "string", + "description": "Recipient email address" + }, + "subject": { + "type": "string", + "description": "Email subject" + }, + "body": { + "type": "string", + "description": "Email body content" + } + }, + "required": ["to", "subject", "body"] + } + } + }, + "description": "Send email messages, email communication, contact people via email, mail, message, correspondence, notify, inform", + "category": "communication", + "tags": ["email", "send", "communication", "message", "contact"] + }, + { + "tool": { + "type": "function", + "function": { + "name": "create_calendar_event", + "description": "Create a new calendar event or appointment", + "parameters": { + "type": "object", + "properties": { + "title": { + "type": "string", + "description": "Event title" + }, + "date": { + "type": "string", + "description": "Event date in YYYY-MM-DD format" + }, + "time": { + "type": "string", + "description": "Event time in HH:MM format" + }, + "duration": { + "type": "integer", + "description": "Duration in minutes" + } + }, + "required": ["title", "date", "time"] + } + } + }, + "description": "Schedule meetings, create calendar events, set appointments, manage calendar, book time, plan meeting, organize schedule, reminder, agenda", + "category": "productivity", + "tags": ["calendar", "event", "meeting", "appointment", "schedule"] + } +] \ No newline at end of file diff --git a/deploy/kubernetes/istio/vLlama3.yaml b/deploy/kubernetes/istio/vLlama3.yaml new file mode 100644 index 00000000..9c3f2a78 --- /dev/null +++ b/deploy/kubernetes/istio/vLlama3.yaml @@ -0,0 +1,100 @@ +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: llama-8b + namespace: default +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 40Gi +# storageClassName: default + volumeMode: Filesystem +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: llama-8b + namespace: default + labels: + app: vllm-llama3-8b-instruct +spec: + replicas: 1 + selector: + matchLabels: + app: vllm-llama3-8b-instruct + template: + metadata: + labels: + app: vllm-llama3-8b-instruct + spec: + volumes: + - name: cache-volume + persistentVolumeClaim: + claimName: llama-8b + # vLLM needs to access the host's shared memory for tensor parallel inference. + # - name: shm + # emptyDir: + # medium: Memory + # sizeLimit: "2Gi" + containers: + - name: llama-8b + image: vllm/vllm-openai:latest + command: ["/bin/sh", "-c"] + args: [ + "vllm serve meta-llama/Llama-3.1-8B-Instruct --served-model-name llama3-8b meta-llama/Llama-3.1-8B-Instruct --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024" + ] + env: + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: token + ports: + - containerPort: 8000 + resources: + limits: + # cpu: "10" + # memory: 40G + nvidia.com/gpu: "1" + requests: + # cpu: "10" + # memory: 40Gi + nvidia.com/gpu: "1" + volumeMounts: + - mountPath: /root/.cache/huggingface + name: cache-volume + # - name: shm + # mountPath: /dev/shm + livenessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 600 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 600 + periodSeconds: 5 +--- +apiVersion: v1 +kind: Service +metadata: + name: llama-8b + namespace: default +spec: + ports: + - name: http-llama-8b + port: 80 + protocol: TCP + targetPort: 8000 + # The label selector should match the deployment labels & it is useful for prefix caching feature + selector: + app: vllm-llama3-8b-instruct + sessionAffinity: None + type: ClusterIP + diff --git a/deploy/kubernetes/istio/vPhi4.yaml b/deploy/kubernetes/istio/vPhi4.yaml new file mode 100644 index 00000000..5befc779 --- /dev/null +++ b/deploy/kubernetes/istio/vPhi4.yaml @@ -0,0 +1,100 @@ +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: phi4-mini + namespace: default +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 20Gi +# storageClassName: default + volumeMode: Filesystem +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: phi4-mini + namespace: default + labels: + app: phi4-mini +spec: + replicas: 1 + selector: + matchLabels: + app: phi4-mini + template: + metadata: + labels: + app: phi4-mini + spec: + volumes: + - name: cache-volume + persistentVolumeClaim: + claimName: phi4-mini + # vLLM needs to access the host's shared memory for tensor parallel inference. + # - name: shm + # emptyDir: + # medium: Memory + # sizeLimit: "2Gi" + containers: + - name: phi4-mini + image: vllm/vllm-openai:latest + command: ["/bin/sh", "-c"] + args: [ + "vllm serve microsoft/Phi-4-mini-instruct --served-model-name phi4-mini microsoft/Phi-4-mini-instruct --trust-remote-code --enable-chunked-prefill" + ] + env: + - name: HUGGING_FACE_HUB_TOKEN + valueFrom: + secretKeyRef: + name: hf-token-secret + key: token + ports: + - containerPort: 8000 + resources: + limits: + # cpu: "10" + # memory: 40G + nvidia.com/gpu: "1" + requests: + # cpu: "10" + # memory: 40Gi + nvidia.com/gpu: "1" + volumeMounts: + - mountPath: /root/.cache/huggingface + name: cache-volume + # - name: shm + # mountPath: /dev/shm + livenessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 600 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /health + port: 8000 + initialDelaySeconds: 600 + periodSeconds: 5 +--- +apiVersion: v1 +kind: Service +metadata: + name: phi4-mini + namespace: default +spec: + ports: + - name: http-phi4-mini + port: 80 + protocol: TCP + targetPort: 8000 + # The label selector should match the deployment labels & it is useful for prefix caching feature + selector: + app: phi4-mini + sessionAffinity: None + type: ClusterIP + diff --git a/src/semantic-router/pkg/config/config.go b/src/semantic-router/pkg/config/config.go index 6720c6a0..a16bf2ae 100644 --- a/src/semantic-router/pkg/config/config.go +++ b/src/semantic-router/pkg/config/config.go @@ -90,6 +90,9 @@ type RouterConfig struct { // Observability configuration for tracing, metrics, and logging Observability ObservabilityConfig `yaml:"observability"` + + // Gateway route cache clearing + ClearRouteCache bool `yaml:"clear_route_cache"` } // APIConfig represents configuration for API endpoints diff --git a/src/semantic-router/pkg/extproc/request_handler.go b/src/semantic-router/pkg/extproc/request_handler.go index 7a762aa7..8cd466ed 100644 --- a/src/semantic-router/pkg/extproc/request_handler.go +++ b/src/semantic-router/pkg/extproc/request_handler.go @@ -74,6 +74,12 @@ func serializeOpenAIRequestWithStream(req *openai.ChatCompletionNewParams, hasSt return sdkBytes, nil } +// shouldClearRouteCache checks if route cache should be cleared +func (r *OpenAIRouter) shouldClearRouteCache() bool { + // Check if feature is enabled + return r.Config.ClearRouteCache +} + // addSystemPromptToRequestBody adds a system prompt to the beginning of the messages array in the JSON request body // Returns the modified body, whether the system prompt was actually injected, and any error func addSystemPromptToRequestBody(requestBody []byte, systemPrompt string, mode string) ([]byte, bool, error) { @@ -894,16 +900,31 @@ func (r *OpenAIRouter) handleModelRouting(openAIRequest *openai.ChatCompletionNe }, }) } + // Set x-selected-model header for non-auto models + setHeaders = append(setHeaders, &core.HeaderValueOption{ + Header: &core.HeaderValue{ + Key: "x-selected-model", + RawValue: []byte(originalModel), + }, + }) + // Create CommonResponse with cache clearing if enabled + commonResponse := &ext_proc.CommonResponse{ + Status: ext_proc.CommonResponse_CONTINUE, + HeaderMutation: &ext_proc.HeaderMutation{ + SetHeaders: setHeaders, + }, + } + + // Check if route cache should be cleared + if r.shouldClearRouteCache() { + commonResponse.ClearRouteCache = true + } + // Set the response with body mutation and content-length removal response = &ext_proc.ProcessingResponse{ Response: &ext_proc.ProcessingResponse_RequestBody{ RequestBody: &ext_proc.BodyResponse{ - Response: &ext_proc.CommonResponse{ - Status: ext_proc.CommonResponse_CONTINUE, - HeaderMutation: &ext_proc.HeaderMutation{ - SetHeaders: setHeaders, - }, - }, + Response: commonResponse, }, }, } @@ -922,6 +943,15 @@ func (r *OpenAIRouter) handleModelRouting(openAIRequest *openai.ChatCompletionNe metrics.RecordRoutingReasonCode("model_specified", originalModel) } + // Check if route cache should be cleared (only for auto models, non-auto models handle this in their own path) + if originalModel == "auto" && r.shouldClearRouteCache() { + // Access the CommonResponse that's already created in this function + if response.GetRequestBody() != nil && response.GetRequestBody().GetResponse() != nil { + response.GetRequestBody().GetResponse().ClearRouteCache = true + observability.Debugf("Setting ClearRouteCache=true (feature enabled) for auto model") + } + } + // Save the actual model that will be used for token tracking ctx.RequestModel = actualModel @@ -1075,15 +1105,24 @@ func (r *OpenAIRouter) updateRequestWithTools(openAIRequest *openai.ChatCompleti SetHeaders: setHeaders, } + // Create CommonResponse + commonResponse := &ext_proc.CommonResponse{ + Status: ext_proc.CommonResponse_CONTINUE, + HeaderMutation: headerMutation, + BodyMutation: bodyMutation, + } + + // Check if route cache should be cleared + if r.shouldClearRouteCache() { + commonResponse.ClearRouteCache = true + observability.Debugf("Setting ClearRouteCache=true (feature enabled) in updateRequestWithTools") + } + // Update the response with body mutation and content-length removal *response = &ext_proc.ProcessingResponse{ Response: &ext_proc.ProcessingResponse_RequestBody{ RequestBody: &ext_proc.BodyResponse{ - Response: &ext_proc.CommonResponse{ - Status: ext_proc.CommonResponse_CONTINUE, - HeaderMutation: headerMutation, - BodyMutation: bodyMutation, - }, + Response: commonResponse, }, }, }