MTEB Paper resources

This repository contains scripts & resources for the MTEB paper. Some scripts rely on a results folder, which can be obtained via git clone https://huggingface.co/datasets/mteb/results. These scripts are unlikely to work with the latest version of MTEB but rather the 1.0.0 release when the paper was released; they are solely to ease reproduction of the original paper. Please refer to the MTEB repository for scripts and resources to work with the latest version and please open any issues with MTEB there; if you have issues with the original MTEB paper you can open them here.

Talks
Benchmark
Env Setup
Model setup
- Download
- Load

Talks

Link to 12min presentation on MTEB by Niklas Muennighoff
Link to 5min presentation on MTEB by Nils Reimers

Benchmark

Basic with Internet

from mteb import MTEB
from sentence_transformers import SentenceTransformer
model_path = "/gpfswork/rech/six/commun/models/Muennighoff_SGPT-125M-weightedmean-nli-bitfit"
model_name = model_path.split("/")[-1].split("_")[-1]
model = SentenceTransformer(model_path)
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model, output_folder=f"results/{model_name}")

No Internet Access (Download data first)

import os
os.environ["HF_DATASETS_OFFLINE"]="1" # 1 for offline
os.environ["TRANSFORMERS_OFFLINE"]="1" # 1 for offline
os.environ["TRANSFORMERS_CACHE"]="/gpfswork/rech/six/commun/models"
os.environ["HF_DATASETS_CACHE"]="/gpfswork/rech/six/commun/datasets"
os.environ["HF_MODULES_CACHE"]="/gpfswork/rech/six/commun/modules"
os.environ["HF_METRICS_CACHE"]="/gpfswork/rech/six/commun/metrics"
from mteb import MTEB
from sentence_transformers import SentenceTransformer
model_path = "/gpfswork/rech/six/commun/models/Muennighoff_SGPT-125M-weightedmean-nli-bitfit"
model_name = model_path.split("/")[-1].split("_")[-1]
model = SentenceTransformer(model_path)
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model, output_folder=f"results/{model_name}")

Env Setup

export CONDA_ENVS_PATH=$six_ALL_CCFRWORK/conda

conda create -y -n hf-prod python=3.8
conda activate hf-prod

# pt-1.10.1 / cuda 11.3
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

# Custom fork that uses offline datasets
!pip install --upgrade git+https://github.com/Muennighoff/mteb.git@offlineaccess
!pip install --upgrade git+https://github.com/Muennighoff/sentence-transformers.git@sgpt_poolings
# If you want to run BEIR tasks
!pip install --upgrade git+https://github.com/beir-cellar/beir.git

Model setup

Download

import os
import sentence_transformers
os.environ["SENTENCE_TRANSFORMERS_HOME"] = "/gpfswork/rech/six/commun/models"
sentence_transformers_cache_dir = os.getenv("SENTENCE_TRANSFORMERS_HOME")
model_repo="sentence-transformers/allenai-specter"
revision="29f9f45ff2a85fe9dfe8ce2cef3d8ec4e65c5f37"
model_path = os.path.join(sentence_transformers_cache_dir, model_repo.replace("/", "_"))
model_path_tmp = sentence_transformers.util.snapshot_download(
    repo_id=model_repo,
    revision=revision,
    cache_dir=sentence_transformers_cache_dir,
    library_name="sentence-transformers",
    library_version=sentence_transformers.__version__,
    ignore_files=["flax_model.msgpack", "rust_model.ot", "tf_model.h5",],
)
os.rename(model_path_tmp, model_path)

Load

model = SentenceTransformer("/gpfswork/rech/six/commun/models/Muennighoff_SGPT-125M-weightedmean-nli-bitfit")

Name	Name	Last commit message	Last commit date
Latest commit Muennighoff Clarify purpose of this repo Sep 22, 2024 6a5f049 · Sep 22, 2024 History 61 Commits
paper	paper	Add paper	Oct 13, 2022
plotstables	plotstables	Add files via upload	May 28, 2023
slurmscripts	slurmscripts	Cleanups	Oct 13, 2022
.gitignore	.gitignore	Remove results	Oct 13, 2022
README.md	README.md	Clarify purpose of this repo	Sep 22, 2024
download_tasks.py	download_tasks.py	Cleanups	Sep 5, 2022
fix_results.py	fix_results.py	Add revision kwarg	Oct 3, 2022
results_to_csv.bash	results_to_csv.bash	Finalize res	Sep 11, 2022
results_to_csv.py	results_to_csv.py	Remove dups	Oct 2, 2022
run_array.py	run_array.py	SGPT models ndcg from old beir / hf repos	Sep 17, 2022
run_array_laser.py	run_array_laser.py	SGPT models ndcg from old beir / hf repos	Sep 17, 2022
run_array_openai.py	run_array_openai.py	Add revision kwarg	Oct 3, 2022
run_array_openaiv2.py	run_array_openaiv2.py	Fix empty txt	Jul 15, 2023
run_array_openaiv3.py	run_array_openaiv3.py	add script for OpenAI Embedding v3	Feb 6, 2024
run_array_sgpt.py	run_array_sgpt.py	Fix script	Jan 29, 2023
run_array_simcse.py	run_array_simcse.py	SGPT models ndcg from old beir / hf repos	Sep 17, 2022
run_benchmark.py	run_benchmark.py	Fixes	Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MTEB Paper resources

Talks

Benchmark

Env Setup

Model setup

Download

Load

About

Releases

Packages

Contributors 2

Languages

embeddings-benchmark/mtebpaper

Folders and files

Latest commit

History

Repository files navigation

MTEB Paper resources

Talks

Benchmark

Env Setup

Model setup

Download

Load

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages