A modular training framework for retrieval models that trains on real-valued labels derived from downstream LLM performance rather than binary human annotations.
Most retrieval models are created by fine-tuning on standard retrieval datasets like MS MARCO, which contain question-answer pairs with context passages labeled by humans as relevant or irrelevant. This traditional approach relies on binary relevance labels that are often noisy and reflect human biases about what constitutes useful context. But what we actually care about is: does the retrieved context help a downstream LLM generate better answers?
This framework addresses this by:
- Soft Label Generation: For each (question, context, answer) triplet used in our retrieval dataset (MS MARCO), we measure how well the context actually helps a frozen decoder LLM generate the correct answer by computing the loss over answer tokens. This loss becomes the "soft" utility labels associated with each context.
- Real-Valued Training: Instead of contrastive loss with binary labels, we fine-tune base encoder models to match similarity scores against these real-valued "soft" utility labels
- End-to-End Evaluation: Rather than evaluating our retrieval models on how well they recover the (noisy, possibly biased) labels, we evaluate on whether retrieved documents actually improve LLM answer generation
The goal is to directly optimize for downstream performance rather than proxy metrics, testing whether this approach produces better retrieval models for LLM-assisted question answering.
Figure 1: Overview of the bitter retrieval training methodology
- Training Data: 80k MS MARCO examples
- Hardware: 1x H100, 2 epochs
- Base Models: BERT-base-uncased, nomic-embed-text-v1-unsupervised
- Evaluation: MS MARCO test data sample. For each test data example, the question and top retrieved context are passed into a newly initialized decoder model that produces an answer. LLM-as-judge (Gemini 2.0) is used to evaluate the correctness of the answer against the ground truth provided in the original dataset. Retrieval accuracy measures whether the retrieved context matches the context marked as "best" by human annotators in the original MS MARCO dataset.
| Model | Training Method | Answer Accuracy β | Retrieval Accuracy |
|---|---|---|---|
| BERT Base | Standard InfoNCE | 45.8% | 30.4% |
| BERT Base | Soft Labels (Ours) | 47.0% β | 29.6% |
| Nomic Embed | Standard InfoNCE | 47.6% | 40.0% |
| Nomic Embed | Soft Labels (Ours) | 47.8% β | 33.6% |
(Nomic Embed trained with Qwen-generated labels β Llama 3.1-8B evaluation)
| Training Method | Answer Accuracy β | Retrieval Accuracy |
|---|---|---|
| Standard InfoNCE | 33.8% | 38.2% |
| Soft Labels (Ours) | 37.6% β | 33.6% |
(BERT trained with Llama-generated labels β Qwen 3 8B evaluation)
| Training Method | Answer Accuracy β | Retrieval Accuracy |
|---|---|---|
| Standard InfoNCE | 46.4% | 31.8% |
| Soft Labels (Ours) | 47.8% β | 28.2% |
- Validation of Core Hypothesis: Models trained on soft labels achieve better downstream LLM performance despite sometimes lower retrieval accuracy on human labels, demonstrating that the human labels are suboptimal.
- Cross-LLM Generalization: Models trained with labels from one LLM (Qwen) generalize well to different LLMs (Llama) during evaluation, often performing even better
- Human Label Limitations: 21.9% disagreement between human annotations and actual LLM utility demonstrates the noise in traditional training data
Figure 2: Comparison between human-labeled "best" passages vs. passages that actually minimize LLM loss
We've created and published soft-labeled versions of MS MARCO v1.1 with ~100k examples labeled using the approach described above:
-
Labeling Model: Qwen3-8B-Base (bf16)
-
Key Finding: 20.3% disagreement rate between human-labeled "best" passages and passages that actually produce lowest LLM loss
-
Labeling Model: Llama-3.1-8B (bf16)
-
Key Finding: 22.2% disagreement rate between human-labeled "best" passages and passages that actually produce lowest LLM loss
Note: This project is actively under development. Results are preliminary and based on initial experiments.
git clone https://github.com/nickcdryan/bitter-retrieval.git
cd bitter-retrieval
./setup.sh
# Restart shell with source ~/.bashrc or open new terminal
poetry run python setup_env.py
poetry run python tests/test_setup.pyCreate a .env file with your API tokens:
cp .env.example .env
# Edit .env with your actual API keys# Use pre-defined experiments
poetry run python scripts/train.py --experiment kl_margin
poetry run python scripts/train.py --config configs/experiments/full_modular.yaml
# Custom training with overrides
poetry run python scripts/train.py --config configs/experiments/kl_only.yaml --batch-size 32 --run-name "my-experiment"bitter-retrieval/
βββ bitter_retrieval/ # π§ Core library
β βββ config.py # Configuration management
β βββ training/ # Training components (losses, trainer, schedulers)
β βββ data/ # Data loading and processing
β βββ models/ # Model utilities (LLM, encoder)
β βββ evaluation/ # Evaluation metrics and orchestration
β βββ utils/ # Utilities (device, encoding, logging, I/O)
βββ scripts/ # π Main entry points
β βββ train.py # Main training script
β βββ label_data.py # Data labeling with LLMs
β βββ evaluate.py # Model evaluation
β βββ setup_env.py # Environment setup
β βββ push_to_hf.py # HuggingFace Hub operations
βββ configs/ # βοΈ Configuration files
β βββ default.yaml # Base configuration
β βββ experiments/ # Pre-defined experiments
β βββ models/ # Model-specific settings
βββ tests/ # π§ͺ Test suite
βββ [setup files, poetry config, etc.]
- Multiple Loss Functions: KL divergence, MSE, margin hinge, InfoNCE variants
- Flexible Combinations: Mix and match loss components with custom weights
- Easy Experimentation: YAML-based configuration with inheritance support
- Modular Training: Combine multiple loss components (
kl + margin,kl + mse + infonce, etc.) - Standard InfoNCE: Classic contrastive learning
- Soft Label Learning: Knowledge distillation from teacher LLMs
- Margin Learning: Contrastive learning with margin constraints
- Encoders: Nomic Embed, BERT, any HuggingFace encoder
- Teacher LLMs: Qwen, Llama, any causal language model
- Optimizations: Flash Attention 2, gradient clipping, LR scheduling
- Multiple Metrics: Retrieval accuracy, F1, EM, LLM judge scores
- Datasets: SQuAD, MS MARCO with configurable corpus sizes
- Async Processing: Batch generation and parallel LLM evaluation
poetry run python scripts/train.py --config configs/experiments/kl_margin.yamlpoetry run python scripts/train.py --config configs/experiments/kl_only.yamlpoetry run python scripts/train.py --config configs/experiments/full_modular.yaml# custom_experiment.yaml
base_config: "configs/default.yaml"
method:
training_method: "modular"
loss_components:
kl: 0.4
mse: 0.3
margin: 0.3Generate soft labels for your own datasets:
# Label MS MARCO data
poetry run python scripts/label_data.py --model Qwen/Qwen3-8B-Base --split train --num 1000
# Process all splits and upload to HF Hub
poetry run python scripts/label_data.py --model Qwen/Qwen3-8B-Base --all-splits --upload-hf "username/dataset-name"# Evaluate trained model
poetry run python scripts/evaluate.py --model-path models/my-trained-model
# Evaluate base model
poetry run python scripts/evaluate.py --encoder nomic-ai/nomic-embed-text-v1 --dataset squad
# Save results
poetry run python scripts/evaluate.py --model-path models/my-model --output-file results.json# Push trained model to HF Hub
poetry run python scripts/push_to_hf.py --model models/my-model --repo username/model-name
# Push dataset to HF Hub
poetry run python scripts/push_to_hf.py --dataset data/my-dataset.json --repo username/dataset-name --type dataset
# List local models
poetry run python scripts/push_to_hf.py --list-localkl_margin.yaml- KL divergence + margin hinge losskl_only.yaml- Pure knowledge distillationmargin_only.yaml- Pure contrastive learninginfonce_only.yaml- Standard InfoNCE baselinefull_modular.yaml- Multi-component training
nomic_embed.yaml- Optimized for Nomic Embed Text v1qwen_8b.yaml- Optimized for Qwen 3-8B Basebert_base.yaml- BERT Base alternative
See configs/README.md for detailed configuration documentation.
poetry run python setup_env.py --install-flash-attentionpoetry run python tests/test_setup.py # Test environment setup- Python 3.11+
- PyTorch 2.7+ with CUDA support
- Transformers, Datasets, W&B
- Optional: Flash Attention 2 for performance
- Linux: Ubuntu/Debian, Fedora/RHEL, Arch/Manjaro
- GPU: CUDA 12.8+ recommended
- Memory: 16GB+ RAM, 8GB+ VRAM recommended
If you use this framework in your research, please cite:
@software{bitter_retrieval_2024,
title = {Bitter Retrieval: Modular Training Framework for Retrieval Models},
author = {Nick Ryan},
year = {2024},
url = {https://github.com/nickcdryan/bitter-retrieval}
}This project is licensed under the MIT License - see the LICENSE file for details.