This framework provides an easy method to compute embeddings for accessing, using, and training state-of-the-art embedding and reranker models. It compute embeddings using Sentence Transformer models (quickstart) or to calculate similarity scores using Cross-Encoder (a.k.a. reranker) models (quickstart). This unlocks a wide range of applications, including semantic search, semantic textual similarity, and paraphrase mining.
A wide selection of over 10,000 pre-trained Sentence Transformers models are available for immediate use on 🤗 Hugging Face, including many of the state-of-the-art models from the Massive Text Embeddings Benchmark (MTEB) leaderboard. Additionally, it is easy to train or finetune your own embedding models or reranker models using Sentence Transformers, enabling you to create custom models for your specific use cases.
For the full documentation, see www.SBERT.net.
We recommend Python 3.9+, PyTorch 1.11.0+, and transformers v4.34.0+.
Install with pip
pip install -U sentence-transformers
Install with conda
conda install -c conda-forge sentence-transformers
Install from sources
Alternatively, you can also clone the latest version from the repository and install it directly from the source code:
pip install -e .
PyTorch with CUDA
If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.
See Quickstart in our documentation.
First download a pretrained embedding a.k.a. Sentence Transformer model.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
Then provide some texts to the model.
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# => (3, 384)
And that's already it. We now have a numpy arrays with the embeddings, one for each text. We can use these to compute similarities.
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])
First download a pretrained reranker a.k.a. Cross Encoder model.
from sentence_transformers import CrossEncoder
# 1. Load a pretrained CrossEncoder model
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
Then provide some texts to the model.
# The texts for which to predict similarity scores
query = "How many people live in Berlin?"
passages = [
"Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.",
"Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.",
"In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.",
]
# 2a. predict scores pairs of texts
scores = model.predict([(query, passage) for passage in passages])
print(scores)
# => [8.607139 5.506266 6.352977]
And we're good to go. You can also use model.rank
to avoid having to perform the reranking manually:
# 2b. Rank a list of passages for a query
ranks = model.rank(query, passages, return_documents=True)
print("Query:", query)
for rank in ranks:
print(f"- #{rank['corpus_id']} ({rank['score']:.2f}): {rank['text']}")
"""
Query: How many people live in Berlin?
- #0 (8.61): Berlin had a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.
- #2 (6.35): In 2013 around 600,000 Berliners were registered in one of the more than 2,300 sport and fitness clubs.
- #1 (5.51): Berlin has a yearly total of about 135 million day visitors, making it one of the most-visited cities in the European Union.
"""
We provide a large list of pretrained models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases.
This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.
- Embedding Models
- Reranker Models
Some highlights across both types of training are:
- Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
- Multi-Lingual and multi-task learning
- Evaluation during training to find optimal model
- 20+ loss functions for embedding models and 10+ loss functions for reranker models, allowing you to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, contrastive loss, etc.
You can use this framework for:
- Computing Sentence Embeddings
- Semantic Textual Similarity
- Semantic Search
- Retrieve & Re-Rank
- Clustering
- Paraphrase Mining
- Translated Sentence Mining
- Multilingual Image Search, Clustering & Duplicate Detection
and many more use-cases.
For all examples, see examples/sentence_transformer/applications.
After cloning the repo (or a fork) to your machine, in a virtual environment, run:
python -m pip install -e ".[dev]"
pre-commit install
To test your changes, run:
pytest
If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:
@inproceedings{reimers-2020-multilingual-sentence-bert,
title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2004.09813",
}
Please have a look at Publications for our different publications that are integrated into SentenceTransformers.
Maintainer: Tom Aarsen, 🤗 Hugging Face
https://www.ukp.tu-darmstadt.de/
Don't hesitate to open an issue if something is broken (and it shouldn't be) or if you have further questions.
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.