Releases · huggingface/text-embeddings-inference

23 Mar 11:57

alvarobartt

v1.9.3

0667015

v1.9.3 Latest

Latest

What's Changed

Use rust-toolchain.toml before rustup on Dockerfile-{cuda,cuda-all} by @alvarobartt in #842
fix(backend): replace bare except with Exception in device check by @llukito in #821
Set version 1.9.3 by @alvarobartt in #849

New Contributors

@llukito made their first contribution in #821

Full Changelog: v1.9.2...v1.9.3

Contributors

alvarobartt and llukito

Assets 2

25 Feb 11:17

alvarobartt

v1.9.2

1d6ceb4

v1.9.2

What's Changed

Fix auto-truncate false setting by @vrdn-23 in #836
Set pad_token_id as nullable & add support for rope_parameters by @alvarobartt in #832
docs: add Homebrew installation to README by @Peredery in #834
feat: support pplx-embed-v1 by @mkrimmel-pplx in #824

New Contributors

@Peredery made their first contribution in #834
@mkrimmel-pplx made their first contribution in #824

Full Changelog: v1.9.1...v1.9.2

Contributors

vrdn-23, alvarobartt, and 2 other contributors

Assets 2

17 Feb 20:59

alvarobartt

v1.9.1

b38b8f1

v1.9.1

What's Changed

🚨 Fix

Fix support for containers w/ CUDA 13.0+ by @alvarobartt in #831

When releasing ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 with CUDA 12.9 and cuda-compat-12-9 there was an issue when running that same container on instances with CUDA 13.0+, as the cuda-compat-12-9 set in LD_LIBRARY_PATH was leading to a CUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803, which is now solved with a custom entrypoint that dynamically includes the cuda-compat on the LD_LIBRARY_PATH depending on the instance CUDA version.

Full Changelog: v1.9.0...v1.9.1

Contributors

alvarobartt

Assets 2

17 Feb 13:42

alvarobartt

v1.9.0

5699247

v1.9.0

What's changed?

🚨 Breaking changes

Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf by @vrdn-23 in #753

Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when hidden_act="gelu" is set in config.json, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.

Set --auto-truncate to true by default by @alvarobartt in #829

--auto-truncate now defaults to true, meaning that the sequences will be truncated to the lower value between the --max-batch-tokens or the maximum model length, to prevent the --max-batch-tokens from being lower than the actual maximum supported length.

🎉 Additions

Add --served-model-name for OpenAI requests via HTTP by @vrdn-23 in #685
Extend download_onnx to download sharded ONNX by @alvarobartt in #817
Add support for llama 2 by @michaelfeil in #802
Add support for blackwell architecture (sm100, sm120) by @danielealbano in #735
Mf/add-support-for-llama-3-and-nemotron by @michaelfeil in #805
Add support for DebertaV2 by @vrdn-23 in #746
Add bidirectional attention and projection layer support for Qwen3-based models by @williambarberjr in #808

🐛 Fixes

Fix reading non-standard config for past_key_values in ONNX by @alvarobartt in #751
Fix TruncationDirection to deserialize from lowercase and capitalized by @alvarobartt in #755
Fix sagemaker-entrypoint* & remove SageMaker and Vertex from Dockerfile* by @alvarobartt in #699
Bug: Critical accuracy bugs for model_type=qwen2: no causal attention and wrong tokenizer by @michaelfeil in #762
Fix config.json reading w/ aliases for ORT by @alvarobartt in #786
Fix HTTP error code for validation by @vrdn-23 in #818
Fix to acquire the permit in a blocking way by @kozistr in #726
Read Hugging Face Hub token from cache if not provided by @alvarobartt in #814
Align the normalize param between the gRPC and HTTP /embed interfaces by @kozistr in #810

⚡ Improvements

Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models by @michaelfeil in #767
Remove default --model-id argument by @alvarobartt in #679
feat: better Tokenization # workers heuristic by @michaelfeil in #766
add faster index select kernel by @michaelfeil in #773
feat: speedup Parallel safetensors download by @michaelfeil in #765
feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time by @michaelfeil in #772
Adjust the warmup phase for CPU by @kozistr in #792

📄 Other

Skip Gemma3 tests when HF_TOKEN not set by @alvarobartt in #812
Bump Rust 1.92, CUDA 12.6, Ubuntu 24.04 and add Dockerfile-cuda-blackwell-all by @alvarobartt in #823
Update rustc version to 1.92.0 by @alvarobartt in #826
Add use_flash_attn for better FA + FA2 feature gating by @alvarobartt in #825
Update CUDA to 12.9 w/ cuda-compat-12-9 by @alvarobartt in #828
Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #782
Lint: cargo fmt and clippy fix warnings by @michaelfeil in #776
Fix rustfmt on backend/candle/tests/*.rs files by @alvarobartt in #800
Upgrade GitHub Actions to latest versions by @salmanmkc in #783
Update version to 1.9.0 by @alvarobartt in #830

🆕 New Contributors

@salmanmkc made their first contribution in #782
@danielealbano made their first contribution in #735
@williambarberjr made their first contribution in #808

Full Changelog: v1.8.3...v1.9.0

Contributors

danielealbano, williambarberjr, and 5 other contributors

Assets 2

30 Oct 09:08

alvarobartt

v1.8.3

78502d8

v1.8.3

What's Changed

Bug Fixes

Fix error code for empty requests by @vrdn-23 in #727
Fix the infinite loop when max_input_length is bigger than max-batch-tokens by @kozistr in #725
Fix reading modules.json for Dense modules in local models by @alvarobartt in #738

Tests, Documentation & Release

Add test_gemma3.rs for EmbeddingGemma by @alvarobartt in #718
Fix OpenAI client usage example for embeddings by @ZahraDehghani99 in #720
Handle HF_TOKEN in ApiBuilder for candle/tests by @alvarobartt in #724
Fix cargo install commands for candle with CUDA by @alvarobartt in #719
Update version to 1.8.3 by @alvarobartt in #745

New Contributors

@ZahraDehghani99 made their first contribution in #720
@vrdn-23 made their first contribution in #727

Full Changelog: v1.8.2...v1.8.3

Contributors

kozistr, vrdn-23, and 2 other contributors

Assets 2

09 Sep 14:45

alvarobartt

v1.8.2

d7af1fc

v1.8.2

🔧 Fixed Intel MKL Support

Since Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the candle dependency. Neither static-linking nor dynamic-linking worked correctly, which caused models using Intel MKL on CPU to fail with errors such as: "Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM".

Starting with v1.8.2, this issue has been resolved by fixing how the intel-mkl-src dependency is defined. Both features, static-linking and dynamic-linking (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.

This issue occurred in the following scenarios:

Users installing text-embeddings-router via cargo with the --feature mkl flag. Although dynamic-linking should have been used, it was not working as intended.
Users relying on the CPU Dockerfile when running models without ONNX weights. In these cases, Safetensors weights were used with candle as backend (with MKL optimizations), instead of ort.

The following table shows the affected versions and containers:

Version	Image
1.7.0	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0`
1.7.1	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1`
1.7.2	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2`
1.7.3	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3`
1.7.4	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4`
1.8.0	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0`
1.8.1	`ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1`

More details: PR #715

Full Changelog: v1.8.1...v1.8.2

Assets 2

04 Sep 15:22

alvarobartt

v1.8.1

0adb000

v1.8.1

Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.

CPU:

docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32

CPU with ONNX Runtime:

docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean

NVIDIA CUDA:

docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32

Notable Changes

Add support for Gemma3 (text-only) architecture
Intel updates to Synapse 1.21.3 and IPEX 2.8
Extend ONNX Runtime support in OrtRuntime
- Support position_ids and past_key_values as inputs
- Handle padding_side and pad_token_id

What's Changed

Adjust HPU warmup: use dummy inputs with shape more close to real scenario by @kaixuanliu in #689
Add extra_args to trufflehog to exclude unverified results by @alvarobartt in #696
Update GitHub templates & fix mentions to Text Embeddings Inference by @alvarobartt in #697
Disable Flash Attention with USE_FLASH_ATTENTION by @alvarobartt in #692
Add support for position_ids and past_key_values in OrtBackend by @alvarobartt in #700
HPU upgrade to Synapse 1.21.3 by @kaixuanliu in #703
Upgrade to IPEX 2.8 by @kaixuanliu in #702
Parse modules.json to identify default Dense modules by @alvarobartt in #701
Add padding_side and pad_token_id in OrtBackend by @alvarobartt in #705
Update docs/openapi.json for v1.8.0 by @alvarobartt in #708
Add Gemma3 architecture (text-only) by @alvarobartt in #711
Update version to 1.8.1 by @alvarobartt in #712

Full Changelog: v1.8.0...v1.8.1

Contributors

kaixuanliu and alvarobartt

Assets 2

05 Aug 08:31

alvarobartt

v1.8.0

2bff275

v1.8.0

Notable Changes

Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs
NomicBert MoE support
JinaAI Re-Rankers V1 support
Matryoshka Representation Learning (MRL)
Dense layer module support (after pooling)

Note

Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.

What's Changed

[Docs] Update quick tour by @NielsRogge in #574
Update README.md and supported_models.md by @alvarobartt in #572
Back with linting. by @Narsil in #577
[Docs] Add cloud run example by @NielsRogge in #573
Fixup by @Narsil in #578
Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in #576
Removing requirements file. by @Narsil in #585
Removing candle-extensions to live on crates.io by @Narsil in #583
Bump sccache to 0.10.0 and sccache-action to 0.0.9 by @alvarobartt in #586
optimize the performance of FlashBert Path for HPU by @kaixuanliu in #575
Revert "Removing requirements file. (#585)" by @Narsil in #588
Get opentelemetry trace id from request headers by @kozistr in #425
Add argument for configuring Prometheus port by @kozistr in #589
Adding missing head. prefix in the weight name in ModernBertClassificationHead by @kozistr in #591
Fixing the CI (grpc path). by @Narsil in #593
fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in #595
enable flash mistral model for HPU device by @kaixuanliu in #594
remove optimum-habana dependency by @kaixuanliu in #599
Support NomicBert MoE by @kozistr in #596
Remove duplicate short option '-p' to fix router executable by @cebtenzzre in #602
Update text-embeddings-router --help output by @alvarobartt in #603
Warmup padded models too. by @Narsil in #592
Add support for JinaAI Re-Rankers V1 by @alvarobartt in #582
Gte diffs by @Narsil in #604
Fix the weight name in GTEClassificationHead by @kozistr in #606
upgrade pytorch and ipex to 2.7 version by @kaixuanliu in #607
upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in #608
Patch DistilBERT variants with different weight keys by @alvarobartt in #614
add offline modeling for model jinaai/jina-embeddings-v2-base-code to avoid auto_map to other repository by @kaixuanliu in #612
Add mean pooling strategy for Modernbert classifier by @kwnath in #616
Using serde for pool validation. by @Narsil in #620
Preparing the update to 1.7.1 by @Narsil in #623
Adding suggestions to fixing missing ONNX files. by @Narsil in #624
Add Qwen3Model by @alvarobartt in #627
Add HiddenAct::Silu (remove serde alias) by @alvarobartt in #631
Add CPU support for Qwen3-Embedding models by @randomm in #632
refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in #625
Support Qwen3 w/ fp32 on GPU by @kozistr in #634
Preparing the release. by @Narsil in #639
Default to Qwen3 in README.md and docs/ examples by @alvarobartt in #641
Fix Qwen3 by @kozistr in #646
Add integration tests for Gaudi by @baptistecolle in #598
Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
Fix FlashQwen3 by @kozistr in #650
Make flake work on metal by @Narsil in #654
Fixing metal backend. by @Narsil in #655
Qwen3 hpu support by @kaixuanliu in #656
change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
Update version to 1.7.3 by @alvarobartt in #666
Add last token pooling support for ORT. by @tpendragon in #664
Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
Fix fmt by re-running pre-commit by @alvarobartt in #671
Update version to 1.7.4 by @alvarobartt in #677
Support MRL (Matryoshka Representation Learning) by @kozistr in #676
Add Dense layer for 2_Dense/ modules by @alvarobartt in #660
Update version to 1.8.0 by @alvarobartt in #686

New Contributors

@NielsRogge made their first contribution in #574
@cebtenzzre made their first contribution in #602
@kwnath made their first contribution in #616
@randomm made their first contribution in #632
@lance-miles made their first contribution in #648
@tpendragon made their first contribution in #664

Full Changelog: v1.7.0...v1.8.0

Contributors

Narsil, randomm, and 9 other contributors

Assets 2

07 Jul 12:33

alvarobartt

v1.7.4

6e900af

v1.7.4

Noticeable Changes

Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null values, as well as a missing to_dtype call on the attention_bias when working with batches.

What's Changed

Fix Qwen3 Embedding Float16 DType by @tpendragon in #663
Fix fmt by re-running pre-commit by @alvarobartt in #671
Update version to 1.7.4 by @alvarobartt in #677

Full Changelog: v1.7.3...v1.7.4

Contributors

tpendragon and alvarobartt

Assets 2

30 Jun 10:54

alvarobartt

v1.7.3

fb80177

v1.7.3

Noticeable Changes

Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.

What's Changed

Default to Qwen3 in README.md and docs/ examples by @alvarobartt in #641
Fix Qwen3 by @kozistr in #646
Add integration tests for Gaudi by @baptistecolle in #598
Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in #648
Fix FlashQwen3 by @kozistr in #650
Make flake work on metal by @Narsil in #654
Fixing metal backend. by @Narsil in #655
Qwen3 hpu support by @kaixuanliu in #656
change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in #659
Update version to 1.7.3 by @alvarobartt in #666
Add last token pooling support for ORT. by @tpendragon in #664

New Contributors

@lance-miles made their first contribution in #648
@tpendragon made their first contribution in #664

Full Changelog: v1.7.2...v1.7.3

Contributors

Narsil, tpendragon, and 5 other contributors

Assets 2

Releases: huggingface/text-embeddings-inference

v1.9.3

What's Changed

New Contributors

Contributors

Uh oh!

v1.9.2

What's Changed

New Contributors

Contributors

Uh oh!

v1.9.1

What's Changed

🚨 Fix

Contributors

Uh oh!

v1.9.0

What's changed?

🚨 Breaking changes

🎉 Additions

🐛 Fixes

⚡ Improvements

📄 Other

🆕 New Contributors

Contributors

Uh oh!

v1.8.3

What's Changed

Bug Fixes

Tests, Documentation & Release

New Contributors

Contributors

Uh oh!

v1.8.2

🔧 Fixed Intel MKL Support

Uh oh!

v1.8.1

Notable Changes

What's Changed

Contributors

Uh oh!

v1.8.0

Notable Changes

What's Changed

New Contributors

Contributors

Uh oh!

v1.7.4

Noticeable Changes

What's Changed

Contributors

Uh oh!

v1.7.3

Noticeable Changes

What's Changed

New Contributors

Contributors

Uh oh!