v0.4.0
Overview
We are thrilled to announce the v0.4.0 release—our biggest update yet! This version brings powerful new Endpoint Picker (EPP) scheduler capabilities, performance improvements, and initial Gateway conformance tests.
Major Highlights
-
Modular Endpoint Picker (EPP) Scheduler: A kube-scheduler–style plugin API lets you build custom routing logic,
filter and score backends, or swap in new picker strategies without touching core code. -
Prefix-Cache-Aware Routing: Dramatically lower tail latency by routing requests based on cached network prefixes,
improving response times under load. -
Richer Metrics: Gain deeper insights with new metrics including:
- NTPOT (Normalized Time Per Output Token)
- Scheduler latency
- Per-pod queue depth
- Build and version info
-
Optional vLLM Simulator Backend: Spin up a lightweight simulator for local development and testing—no real model
servers required. -
Initial Conformance Tests: Validate your controller’s behavior with end-to-end tests covering InferencePool,
InferenceModel, HTTPRoute, and more.
What's Changed
- Adding larger logo by @robscott in #630
- Minor fixes to the user guide by @nicolexin in #633
- Add istio to implementations.md by @LiorLieberman in #631
- Update e2e test config by @kfswain in #636
- Fix parsing issue in BBR helm by @rramkumar1 in #638
- fixed bug - sleep is expecting to get a string by @nirrozenbaum in #618
- #632 Add favicon for doc site by @Conor0Callaghan in #634
- Move integration test utils to central package by @rramkumar1 in #626
- BBR readme fixes by @rramkumar1 in #640
- Add integration tests to exercise streaming mode in BBR by @rramkumar1 in #627
- Adding 2 new reviewers to the reviewers alias by @kfswain in #644
- Add initial implementer's guide by @nicolexin in #635
- Update BBR istio.yaml to use FULL_DUPLEX_STREAMED mode by @rramkumar1 in #629
- Docs: Bumps Kgateway to v2.0.0 by @danehans in #646
- remove deprecated v1alpha2.AddToScheme and use v1alpha2.Install instead by @nirrozenbaum in #649
- removed time.sleep and using ticker instead by @nirrozenbaum in #648
- update release version in README by @nirrozenbaum in #653
- fix some issues in e2e tests by @nirrozenbaum in #621
- Refactor scheduler to make it more readable by @liu-cong in #645
- Getting started docs version bump by @SachinVarghese in #654
- expose "Normalized Time Per Output Token" (NTPOT) metric by @kaushikmitr in #643
- Bump github.com/onsi/ginkgo/v2 from 2.23.3 to 2.23.4 by @dependabot in #657
- Bump google.golang.org/grpc from 1.71.0 to 1.71.1 by @dependabot in #658
- Fix links and description in implementations.md by @xiaolin593 in #650
- fix manifests and description in the user guides by @cr7258 in #652
- Bump github.com/onsi/gomega from 1.36.3 to 1.37.0 by @dependabot in #659
- adjust the gpu deployment to increase max batch size by @ahg-g in #642
- Cleaning up config pkg by @ahg-g in #663
- Rename pkg/body-based-routing to pkg/bbr by @rramkumar1 in #664
- deploy: Enable logging for GKE gateway by default by @smarterclayton in #666
- moved IsPodReady func to podutils by @nirrozenbaum in #662
- removed double loop on docs in hermetic test by @nirrozenbaum in #668
- fix bbr dockerfile that was broken in PR #664 by @nirrozenbaum in #669
- Use dedicated namespace for e2e test code by @rramkumar1 in #661
- cleaning up inferencePool helm docs by @ahg-g in #665
- move inf model IsCritial func out of datastore by @nirrozenbaum in #670
- Consolidating down to FULL_DUPLEX_STREAMED supported ext-proc server by @kfswain in #672
- Document model server compatibility and config options by @liu-cong in #537
- Bump github.com/prometheus/client_model from 0.6.1 to 0.6.2 by @dependabot in #687
- Bump github.com/prometheus/client_golang from 1.21.1 to 1.22.0 by @dependabot in #688
- added badges to README by @nirrozenbaum in #682
- Bump sigs.k8s.io/structured-merge-diff/v4 from 4.6.0 to 4.7.0 by @dependabot in #686
- docs(gateways): fix Envoy AI Gateway link by @maxbrunet in #700
- minor changes in few places by @nirrozenbaum in #702
- Docs: Adds Kgateway Cleanup to Quickstart by @danehans in #701
- using namespaced name by @nirrozenbaum in #707
- EPP Architecture proposal by @kfswain in #683
- removed unused Fake struct by @nirrozenbaum in #723
- epp: return correct response for trailers by @howardjohn in #726
- Refactor scheduler to run plugins by @liu-cong in #677
- Complete the InferencePool documentation by @nicolexin in #673
- reduce log level in metrics logger not to trash the log by @nirrozenbaum in #708
- few updates in datastore by @nirrozenbaum in #713
- scheduler restructuring by @nirrozenbaum in #730
- filter irrelevant pods in pod controller by @nayihz in #696
- EPP: Update GetRandomPod() to return nil if no pods exist by @danehans in #731
- Move filter and scorer plugins registration to a separate file by @mayabar in #729
- Update issue templates by @kfswain in #738
- docs: add concepts and definitions to README.md by @shaneutt in #734
- Add unit tests for pod APIs under pkg/datastore by @rlakhtakia in #712
- added a target dedicated for running unit-test only by @nirrozenbaum in #739
- Updating proposal directories to match their PR number by @kfswain in #741
- Fixing errors in new template & disabling the default blank template by @kfswain in #742
- fixed broken link to implementations by @nirrozenbaum in #750
- Weighted scorers by @nirrozenbaum in #737
- add max score picker by @nirrozenbaum in #752
- Add GetEnvString helper function by @liu-cong in #758
- Bump the kubernetes group with 6 updates by @dependabot in #754
- extract pod representation from backend/metrics to backend by @nirrozenbaum in #751
- Request for adding Alibaba Cloud Container Service for Kubernetes (ACK) into implementations by @delavet in #748
- fixed error message in scheduler when no pods are available by @nirrozenbaum in #759
- feat: Initial setup for conformance test suite by @SinaChavoshi in #720
- Move scheduler initialization up to the main by @liu-cong in #757
- Add inference_extension_info metric for project metadata by @JeffLuoo in #744
- chore: make SchedulerConfig fields configurable by @shaneutt in #764
- fix: pass commit hash from the cloud build default variable by @JeffLuoo in #763
- Small refactor to capture request data for route. by @kfswain in #765
- Add queue and kv-cache scorers by @liu-cong in #762
- Add scheduler e2e latency metric by @liu-cong in #767
- Parse request x-request-id and expose it in contextual logger by @delavet in #746
- put SchedulerConfig fields private again. added NewSchedulerConfig func by @nirrozenbaum in #771
- Create unit test for request handler by @rlakhtakia in #745
- Add feature request link for adding Triton LoRA metric by @liu-cong in #773
- remove EndpointSlice from RBAC by @nirrozenbaum in #774
- passing headers to scheduler plugins by @nirrozenbaum in #775
- add labels to pod metadata for the use of scheduler plugins by @nirrozenbaum in #779
- Update istio version by @LiorLieberman in #780
- feat: Add metric that records length of queue for each model server pods by @JeffLuoo in #776
- chore: update golang.google.org/grpc dep from v1.71.1 to v1.72.0 by @shaneutt in #777
- docs: fixed inference pool docs by @capri-xiyue in #784
- Bump sigs.k8s.io/gateway-api from 1.2.1 to 1.3.0 by @dependabot in #785
- Healthcheck fix by @kfswain in #788
- Docs: Updates Benchmark Guide by @danehans in #789
- e2e: Fixes 404 Not Found Error by @danehans in #793
- EPP architectural refactor by @kfswain in #781
- remove empty request_test.go file. by @nirrozenbaum in #796
- Clean up filters by @liu-cong in #802
- Refactor: Improve env utility by @LukeAVanDrie in #803
- refactor scheduler filters package by @nirrozenbaum in #797
- fix labels not cloned bug by @nirrozenbaum in #804
- fixed datastore bug to clean all go routines when pool is unset by @nirrozenbaum in #810
- Optimize Dockerfile for Multiple Extensions by @GunaKKIBM in #811
- merge has capacity filter with sheddable filter. by @nirrozenbaum in #809
- feat(conformance): Add initial InferencePool tests and shared Gateway setup by @SinaChavoshi in #772
- Add prefix cache aware scheduling by @liu-cong in #768
- merge functions in env utils by @nirrozenbaum in #819
- generalize scheduling cycle state concept by @nirrozenbaum in #818
- remove Model field from LLMRequest by @nirrozenbaum in #782
- feat: Add support to invoke PostResponse plugins by @shmuelk in #800
- Add prefix aware request scheduling proposal by @liu-cong in #602
- Docs: Bumps Kgateway to v2.0.2 by @danehans in #823
- renamed Metrics to MetricsState and move to a separate file by @nirrozenbaum in #822
- feat: Add build reference to the info metrics by @JeffLuoo in #817
- Introduce SaturationDetector component by @LukeAVanDrie in #808
- support extracting prompt from chat completions API by @delavet in #798
- Fix Test Flakiness by adding short sleep in TestMetricsRefresh by @LukeAVanDrie in #824
- chore(conformance): Add timeout configuration by @SinaChavoshi in #795
- Scheduler subsystem high level design proposal by @smarterclayton in #603
- Updating top level readme by @kfswain in #831
- Meeting is at 10am, not 8 by @alexsnaps in #836
- docs: roll out guide by @capri-xiyue in #829
- reduce log level of "prefix cached servers" to TRACE by @nirrozenbaum in #842
- add regression testing docs by @kaushikmitr in #755
- fixed log before picker by @nirrozenbaum in #844
- Reorganize scheduling plugins by @liu-cong in #837
- updated godoc on scheduler filters, pickers and prefix plugin by @nirrozenbaum in #850
- Fix: Ignore header order in hermetic test by @LukeAVanDrie in #849
- Bump the kubernetes group with 6 updates by @dependabot in #851
- Bump github.com/prometheus/common from 0.63.0 to 0.64.0 by @dependabot in #853
- Updating readme to reflect llm-d collab! by @kfswain in #855
- fix: typo ('endpoing' -> 'endpoint') by @t3hmrman in #857
- Updating readme wording by @kfswain in #858
- adding logging & support for better Client response by @kfswain in #847
- Adding util func for splitting large bodies into chunks by @kfswain in #859
- Scheduler config refactor for simplifying plugins registration by @nirrozenbaum in #835
- Chunk implementation by @kfswain in #860
- feat: merge two metric servers by @nayihz in #728
- docs: added examples to address various generative AI application scenarios by using gateway api inference extension by @capri-xiyue in #812
- docs: Update link to Slack channel by @terrytangyuan in #867
- Multi cycle scheduler by @nirrozenbaum in #862
- feat(conformance): Add test for HTTPRouteInvalidInferencePoolRef by @SinaChavoshi in #807
- feat(conformance): tests for inferencepool_resolvedrefs_condition by @SinaChavoshi in #832
- Update
002-api-proposal/
to reflectapi/v1alpha2
inferencePool and InferenceModel by @shotarok in #870 - use namespacedname instead of name/namespace as separate args in tests by @nirrozenbaum in #873
- remove the PreCycle plugin from scheduler by @nirrozenbaum in #876
- feat(conformance): Update InferencePoolResolvedRefsCondition test for E2E request validation by @SinaChavoshi in #866
- minor changes to saturation detector by @nirrozenbaum in #882
- updated controller-runtime to v0.21.0 and its dependencies by @nirrozenbaum in #890
- fix: broken ext-proc links by @Xunzhuo in #894
- Initial Scheduler Subsystem interface by @kfswain in #845
- fix(README): typo on dashboard by @EyalPazz in #904
- Fix typos and lint errors to pass golangci-lint by @shotarok in #902
- docs: update Istio gateway name for consistency by @shotarok in #903
- chor(conformance): fix header and remove extra comments by @SinaChavoshi in #883
- Tools: Fixes test-e2e.sh script by @danehans in #900
- Amend the endpoint picker protocol to support multiple fallback endpoints by @wbpcode in #761
- remove SchedulingContext, flatten scheduler interfaces by @nirrozenbaum in #889
- Boilerplate verification to ensure LICENSE information is present by @bharathbrat in #880
- Update the Cleanup section for Istio in Getting Started by @shotarok in #906
- Refactor: Externalize Scheduler's saturation logic and criticality-based service differentiation by @LukeAVanDrie in #805
- chore(deps): bump the kubernetes group with 6 updates by @dependabot in #908
- chore(deps): bump github.com/go-logr/logr from 1.4.2 to 1.4.3 by @dependabot in #909
- Adds vLLM Simulator Support by @danehans in #898
- fixed typo in makefile by @nirrozenbaum in #913
- test chat completions api in e2e case by @delavet in #868
- added GetEnvBool function and unit-tests by @nirrozenbaum in #916
- [Refactor] Simplify hermetic test setup for EPP by @LukeAVanDrie in #917
- move PostResponse plugins to requestcontrol instead of scheduler by @nirrozenbaum in #914
- renamed interface from PostResponsePlugin to PostResponse by @nirrozenbaum in #919
- Remove redundant SheddableCapacityFilter. by @LukeAVanDrie in #910
- Add the option to specify epp env vars in helm chart by @liu-cong in #924
- remove Critical boolean from scheduling request by @nirrozenbaum in #921
- Add prefix cache plugin configuration guide by @liu-cong in #923
- added context argument to scheduling profile picker by @nirrozenbaum in #926
- Bumps vLLM Simulator Tag by @danehans in #930
- feat(Conformance): Add a header based filter to make a controllable epp behavior determined by request header. by @zetxqx in #922
- Docs: Fixes Meeting Recording Link by @danehans in #931
- metrics: Add documentation for sample alert rules by @JeffLuoo in #912
- scheduler proposal continuation by @nirrozenbaum in #905
- Changes to multi-model guide in documentation by @elevran in #941
- docs: use inference gateway terminology by @capri-xiyue in #891
- scheduler redesign continuation by @nirrozenbaum in #937
- fix: Mark alert block as yaml to fix syntax error by @JeffLuoo in #954
- docs: Try to polish the go doc comments for InferenceModelSpec by @waltforme in #948
- docs: dashboards README metrics link fix by @EyalPazz in #952
- chore: update golang.google.org/grpc dep from v1.71.1 to v1.72.0 by @ahg-g in #965
- Pin PyPI package versions in requirements.txt by @shotarok in #963
- Fix the DestinationRule's referenced model service name by @keithmattix in #958
- moved main code to runner package under epp/cmd by @nirrozenbaum in #956
New Contributors
- @Conor0Callaghan made their first contribution in #634
- @SachinVarghese made their first contribution in #654
- @xiaolin593 made their first contribution in #650
- @cr7258 made their first contribution in #652
- @maxbrunet made their first contribution in #700
- @howardjohn made their first contribution in #726
- @nayihz made their first contribution in #696
- @mayabar made their first contribution in #729
- @shaneutt made their first contribution in #734
- @rlakhtakia made their first contribution in #712
- @delavet made their first contribution in #748
- @SinaChavoshi made their first contribution in #720
- @capri-xiyue made their first contribution in #784
- @LukeAVanDrie made their first contribution in #803
- @GunaKKIBM made their first contribution in #811
- @shmuelk made their first contribution in #800
- @alexsnaps made their first contribution in #836
- @t3hmrman made their first contribution in #857
- @shotarok made their first contribution in #870
- @EyalPazz made their first contribution in #904
- @wbpcode made their first contribution in #761
- @bharathbrat made their first contribution in #880
- @zetxqx made their first contribution in #922
- @elevran made their first contribution in #941
- @waltforme made their first contribution in #948
Full Changelog: v0.3.0...v0.4.0