Releases: huggingface/text-embeddings-inference
v1.9.3
What's Changed
- Use
rust-toolchain.tomlbeforerustuponDockerfile-{cuda,cuda-all}by @alvarobartt in #842 - fix(backend): replace bare except with Exception in device check by @llukito in #821
- Set
version1.9.3 by @alvarobartt in #849
New Contributors
Full Changelog: v1.9.2...v1.9.3
v1.9.2
What's Changed
- Fix auto-truncate false setting by @vrdn-23 in #836
- Set
pad_token_idas nullable & add support forrope_parametersby @alvarobartt in #832 - docs: add Homebrew installation to README by @Peredery in #834
- feat: support pplx-embed-v1 by @mkrimmel-pplx in #824
New Contributors
- @Peredery made their first contribution in #834
- @mkrimmel-pplx made their first contribution in #824
Full Changelog: v1.9.1...v1.9.2
v1.9.1
What's Changed
🚨 Fix
- Fix support for containers w/ CUDA 13.0+ by @alvarobartt in #831
When releasing ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 with CUDA 12.9 and
cuda-compat-12-9there was an issue when running that same container on instances with CUDA 13.0+, as thecuda-compat-12-9set inLD_LIBRARY_PATHwas leading to aCUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803, which is now solved with a custom entrypoint that dynamically includes thecuda-compaton theLD_LIBRARY_PATHdepending on the instance CUDA version.
Full Changelog: v1.9.0...v1.9.1
v1.9.0
What's changed?
🚨 Breaking changes
Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when
hidden_act="gelu"is set inconfig.json, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.
- Set
--auto-truncatetotrueby default by @alvarobartt in #829
--auto-truncatenow defaults to true, meaning that the sequences will be truncated to the lower value between the--max-batch-tokensor the maximum model length, to prevent the--max-batch-tokensfrom being lower than the actual maximum supported length.
🎉 Additions
- Add
--served-model-namefor OpenAI requests via HTTP by @vrdn-23 in #685 - Extend
download_onnxto download sharded ONNX by @alvarobartt in #817 - Add support for llama 2 by @michaelfeil in #802
- Add support for blackwell architecture (sm100, sm120) by @danielealbano in #735
- Mf/add-support-for-llama-3-and-nemotron by @michaelfeil in #805
- Add support for DebertaV2 by @vrdn-23 in #746
- Add bidirectional attention and projection layer support for Qwen3-based models by @williambarberjr in #808
🐛 Fixes
- Fix reading non-standard config for
past_key_valuesin ONNX by @alvarobartt in #751 - Fix
TruncationDirectionto deserialize from lowercase and capitalized by @alvarobartt in #755 - Fix
sagemaker-entrypoint*& remove SageMaker and Vertex fromDockerfile*by @alvarobartt in #699 - Bug: Critical accuracy bugs for model_type=qwen2: no causal attention and wrong tokenizer by @michaelfeil in #762
- Fix
config.jsonreading w/ aliases for ORT by @alvarobartt in #786 - Fix HTTP error code for validation by @vrdn-23 in #818
- Fix to acquire the permit in a blocking way by @kozistr in #726
- Read Hugging Face Hub token from cache if not provided by @alvarobartt in #814
- Align the
normalizeparam between the gRPC and HTTP /embed interfaces by @kozistr in #810
⚡ Improvements
- Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models by @michaelfeil in #767
- Remove default
--model-idargument by @alvarobartt in #679 - feat: better Tokenization # workers heuristic by @michaelfeil in #766
- add faster index select kernel by @michaelfeil in #773
- feat: speedup Parallel safetensors download by @michaelfeil in #765
- feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time by @michaelfeil in #772
- Adjust the warmup phase for CPU by @kozistr in #792
📄 Other
- Skip Gemma3 tests when
HF_TOKENnot set by @alvarobartt in #812 - Bump Rust 1.92, CUDA 12.6, Ubuntu 24.04 and add
Dockerfile-cuda-blackwell-allby @alvarobartt in #823 - Update
rustcversion to 1.92.0 by @alvarobartt in #826 - Add
use_flash_attnfor better FA + FA2 feature gating by @alvarobartt in #825 - Update CUDA to 12.9 w/
cuda-compat-12-9by @alvarobartt in #828 - Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #782
- Lint: cargo fmt and clippy fix warnings by @michaelfeil in #776
- Fix
rustfmtonbackend/candle/tests/*.rsfiles by @alvarobartt in #800 - Upgrade GitHub Actions to latest versions by @salmanmkc in #783
- Update
versionto 1.9.0 by @alvarobartt in #830
🆕 New Contributors
- @salmanmkc made their first contribution in #782
- @danielealbano made their first contribution in #735
- @williambarberjr made their first contribution in #808
Full Changelog: v1.8.3...v1.9.0
v1.8.3
What's Changed
Bug Fixes
- Fix error code for empty requests by @vrdn-23 in #727
- Fix the infinite loop when
max_input_lengthis bigger thanmax-batch-tokensby @kozistr in #725 - Fix reading
modules.jsonforDensemodules in local models by @alvarobartt in #738
Tests, Documentation & Release
- Add
test_gemma3.rsfor EmbeddingGemma by @alvarobartt in #718 - Fix OpenAI client usage example for embeddings by @ZahraDehghani99 in #720
- Handle
HF_TOKENinApiBuilderforcandle/testsby @alvarobartt in #724 - Fix
cargo installcommands forcandlewith CUDA by @alvarobartt in #719 - Update
versionto 1.8.3 by @alvarobartt in #745
New Contributors
- @ZahraDehghani99 made their first contribution in #720
- @vrdn-23 made their first contribution in #727
Full Changelog: v1.8.2...v1.8.3
v1.8.2
🔧 Fixed Intel MKL Support
Since Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the candle dependency. Neither static-linking nor dynamic-linking worked correctly, which caused models using Intel MKL on CPU to fail with errors such as: "Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM".
Starting with v1.8.2, this issue has been resolved by fixing how the intel-mkl-src dependency is defined. Both features, static-linking and dynamic-linking (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.
This issue occurred in the following scenarios:
- Users installing
text-embeddings-routerviacargowith the--feature mklflag. Althoughdynamic-linkingshould have been used, it was not working as intended. - Users relying on the CPU
Dockerfilewhen running models without ONNX weights. In these cases, Safetensors weights were used withcandleas backend (with MKL optimizations), instead ofort.
The following table shows the affected versions and containers:
| Version | Image |
|---|---|
| 1.7.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0 |
| 1.7.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1 |
| 1.7.2 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2 |
| 1.7.3 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3 |
| 1.7.4 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4 |
| 1.8.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0 |
| 1.8.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 |
More details: PR #715
Full Changelog: v1.8.1...v1.8.2
v1.8.1
Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.
- CPU:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32- CPU with ONNX Runtime:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean- NVIDIA CUDA:
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32Notable Changes
- Add support for Gemma3 (text-only) architecture
- Intel updates to Synapse 1.21.3 and IPEX 2.8
- Extend ONNX Runtime support in
OrtRuntime- Support
position_idsandpast_key_valuesas inputs - Handle
padding_sideandpad_token_id
- Support
What's Changed
- Adjust HPU warmup: use dummy inputs with shape more close to real scenario by @kaixuanliu in #689
- Add
extra_argstotrufflehogto exclude unverified results by @alvarobartt in #696 - Update GitHub templates & fix mentions to Text Embeddings Inference by @alvarobartt in #697
- Disable Flash Attention with
USE_FLASH_ATTENTIONby @alvarobartt in #692 - Add support for
position_idsandpast_key_valuesinOrtBackendby @alvarobartt in #700 - HPU upgrade to Synapse 1.21.3 by @kaixuanliu in #703
- Upgrade to IPEX 2.8 by @kaixuanliu in #702
- Parse
modules.jsonto identify defaultDensemodules by @alvarobartt in #701 - Add
padding_sideandpad_token_idinOrtBackendby @alvarobartt in #705 - Update
docs/openapi.jsonfor v1.8.0 by @alvarobartt in #708 - Add Gemma3 architecture (text-only) by @alvarobartt in #711
- Update
versionto 1.8.1 by @alvarobartt in #712
Full Changelog: v1.8.0...v1.8.1
v1.8.0
Notable Changes
- Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs
- NomicBert MoE support
- JinaAI Re-Rankers V1 support
- Matryoshka Representation Learning (MRL)
- Dense layer module support (after pooling)
Note
Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.
What's Changed
- [Docs] Update quick tour by @NielsRogge in #574
- Update
README.mdandsupported_models.mdby @alvarobartt in #572 - Back with linting. by @Narsil in #577
- [Docs] Add cloud run example by @NielsRogge in #573
- Fixup by @Narsil in #578
- Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
- Removing requirements file. by @Narsil in #585
- Removing candle-extensions to live on crates.io by @Narsil in #583
- Bump
sccacheto 0.10.0 andsccache-actionto 0.0.9 by @alvarobartt in #586 - optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
- Revert "Removing requirements file. (#585)" by @Narsil in #588
- Get opentelemetry trace id from request headers by @kozistr in #425
- Add argument for configuring Prometheus port by @kozistr in #589
- Adding missing
head.prefix in the weight name inModernBertClassificationHeadby @kozistr in #591 - Fixing the CI (grpc path). by @Narsil in #593
- fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
- enable flash mistral model for HPU device by @kaixuanliu in #594
- remove optimum-habana dependency by @kaixuanliu in #599
- Support NomicBert MoE by @kozistr in #596
- Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
- Update
text-embeddings-router --helpoutput by @alvarobartt in #603 - Warmup padded models too. by @Narsil in #592
- Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
- Gte diffs by @Narsil in #604
- Fix the weight name in GTEClassificationHead by @kozistr in #606
- upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
- upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
- Patch DistilBERT variants with different weight keys by @alvarobartt in #614
- add offline modeling for model
jinaai/jina-embeddings-v2-base-codeto avoidauto_mapto other repository by @kaixuanliu in #612 - Add mean pooling strategy for Modernbert classifier by @kwnath in #616
- Using serde for pool validation. by @Narsil in #620
- Preparing the update to 1.7.1 by @Narsil in #623
- Adding suggestions to fixing missing ONNX files. by @Narsil in #624
- Add
Qwen3Modelby @alvarobartt in #627 - Add
HiddenAct::Silu(removeserdealias) by @alvarobartt in #631 - Add CPU support for Qwen3-Embedding models by @randomm in #632
- refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
- Support Qwen3 w/ fp32 on GPU by @kozistr in #634
- Preparing the release. by @Narsil in #639
- Default to Qwen3 in
README.mdanddocs/examples by @alvarobartt in #641 - Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update
versionto 1.7.3 by @alvarobartt in #666 - Add last token pooling support for ORT. by @tpendragon in #664
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix
fmtby re-runningpre-commitby @alvarobartt in #671 - Update
versionto 1.7.4 by @alvarobartt in #677 - Support MRL (Matryoshka Representation Learning) by @kozistr in #676
- Add
Denselayer for2_Dense/modules by @alvarobartt in #660 - Update
versionto 1.8.0 by @alvarobartt in #686
New Contributors
- @NielsRogge made their first contribution in #574
- @cebtenzzre made their first contribution in #602
- @kwnath made their first contribution in #616
- @randomm made their first contribution in #632
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.0...v1.8.0
v1.7.4
Noticeable Changes
Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null values, as well as a missing to_dtype call on the attention_bias when working with batches.
What's Changed
- Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
- Fix
fmtby re-runningpre-commitby @alvarobartt in #671 - Update
versionto 1.7.4 by @alvarobartt in #677
Full Changelog: v1.7.3...v1.7.4
v1.7.3
Noticeable Changes
Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.
What's Changed
- Default to Qwen3 in
README.mdanddocs/examples by @alvarobartt in #641 - Fix Qwen3 by @kozistr in #646
- Add integration tests for Gaudi by @baptistecolle in #598
- Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
- Fix FlashQwen3 by @kozistr in #650
- Make flake work on metal by @Narsil in #654
- Fixing metal backend. by @Narsil in #655
- Qwen3 hpu support by @kaixuanliu in #656
- change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
- Update
versionto 1.7.3 by @alvarobartt in #666 - Add last token pooling support for ORT. by @tpendragon in #664
New Contributors
- @lance-miles made their first contribution in #648
- @tpendragon made their first contribution in #664
Full Changelog: v1.7.2...v1.7.3