(An open-source reimplementation and extension of REFRAG: Rethinking RAG-based Decoding (Lin et al., 2025))
refrag-lite is a compact, reproducible reimplementation of the REFRAG framework that introduces reinforcement-learning-based selective expansion for Retrieval-Augmented Generation (RAG).
While REFRAG established the core idea of compressing most retrieved chunks into dense vectors and selectively expanding a few, refrag-lite focuses on practical usability: one-GPU compatibility, modular RL policy design (Bandit + PPO), and open-weight reproducibility.
Traditional RAG pipelines expand all retrieved chunks into full text tokens—expensive and latency-heavy.
Following the insight from REFRAG (Lin et al., 2025), this project applies token-efficient context construction: most chunks are injected as single learned vectors, while a lightweight RL policy decides which to expand fully under a token budget.
Query → Retrieve K chunks → RL Policy → Expand few, compress rest → LLM → Answer
↓
Most chunks: 1 vector each
Selected chunks: Full tokens
- Compression (from REFRAG): Encode most chunks as dense vectors fed directly to the LLM
- RL-Selective Expansion (extended): Learn which chunks to expand for the best quality/cost trade-off
- Token Budget Control (new): Explicit configurable limits with adaptive policy sampling
- Multi-Stage Training: Reconstruction → Continual Pretraining → SFT → RL
- Bandit + PPO: Start with LinUCB/Thompson Sampling, optionally upgrade to PPO
- One-GPU Friendly: Runs on 24GB VRAM with PEFT/LoRA
- Fully Open-Source: No paid APIs, all models open-weight
# Setup
make setup
# Download data and build indexes (tiny subset)
make data-tiny
# Run end-to-end tiny pipeline
make tiny
# View results
cat reports/tiny_run_summary.mdExpected tiny run results (TinyLlama-1.1B on HotpotQA dev subset):
- Baseline RAG: EM ~15%, TTFT ~2.5s, Tokens ~4000
- Compression-Only: EM ~12%, TTFT ~0.8s, Tokens ~800
- RL-Selective: EM ~25%, TTFT ~1.0s, Tokens ~1500 (50% savings, 2× faster)
Current Performance (Groq API + Real Data):
- EM Score: 25% (exact matches)
- F1 Score: 30% (partial matches)
- TTFT: 1.03s (time to first token)
- Throughput: 514 tok/s
- Memory: 7.0 GB
- Python 3.10+
- PyTorch 2.3+
- GPU acceleration: CUDA 11.8+ (NVIDIA) or MPS (Apple Silicon) or CPU
- 24GB VRAM recommended for full runs (8B models)
- 50GB disk space for data/models
conda env create -f environment.yml
conda activate refrag-litepython -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txtdocker build -t refrag-lite .
docker run -it -v $(pwd)/data:/app/data refrag-lite
# Note: --gpus all only needed for NVIDIA GPUs# Download HotpotQA
python data/scripts/download_hotpotqa.py --output data/hotpotqa
# Build BM25 and FAISS indexes
python data/scripts/build_corpus.py --bm25 --dense --output data/indexes# Stage 1: Token reconstruction from compressed vectors
python -m refrag.train.pretrain_recon --config configs/default.yaml
# Stage 2: Continual pretraining with curriculum
python -m refrag.train.pretrain_cpt --config configs/default.yaml# Fine-tune LLM with mixed inputs (compressed + expanded)
python -m refrag.train.sft_qa --config configs/default.yaml# Train bandit policy (LinUCB or Thompson Sampling)
python -m refrag.rl.train_policy --config configs/rl_bandit.yaml
# Optional: Train PPO policy
python -m refrag.rl.train_policy --config configs/rl_ppo.yaml --algo ppo# Run baselines
bash scripts/run_baselines.sh
# Evaluate QA performance
python -m refrag.eval.qa_eval --config configs/eval.yaml
# Evaluate speed/efficiency
python -m refrag.eval.speed_eval --config configs/eval.yaml
# Generate report with plots
python -m refrag.eval.report --config configs/eval.yamlpython scripts/export_policy.py --checkpoint checkpoints/bandit_best.pt --out policy.binmake setup # Create environment and install dependencies
make data # Download and index full HotpotQA dataset
make data-tiny # Download and index tiny subset (for quick testing)
make tiny # Run complete tiny pipeline
make train # Run pretrain + SFT stages
make rl # Train RL policy
make eval # Run evaluation and generate report
make test # Run unit tests
make clean # Clean generated files
make docker-build # Build Docker image
make docker-run # Run Docker containerAll configuration files are in configs/. Key files:
tiny.yaml: Fast CPU/GPU demo configurationdefault.yaml: Full 1-GPU configurationmodel_llm.yaml: LLM settings (Llama/Mistral/Zephyr)model_encoder.yaml: Encoder + projector settingsretriever.yaml: BM25 + dense retrieval configrl_bandit.yaml: LinUCB/Thompson Sampling parametersrl_ppo.yaml: PPO parameterseval.yaml: Evaluation thresholds and plotting config
# Model selection
llm:
model_name: "meta-llama/Llama-3.1-8B-Instruct" # or TinyLlama, Mistral, Zephyr
use_peft: true
lora_r: 16
encoder:
model_name: "sentence-transformers/all-MiniLM-L6-v2" # or roberta-large
# Token budget
rl:
token_budget: 2000
compression_rate: 0.75 # Compress 75% of chunks
top_k_candidates: 10
# Reward function
reward:
perplexity_weight: -1.0
token_penalty: 0.001
correctness_bonus: 1.0Architecture adapted and simplified from the REFRAG framework (Lin et al., 2025), with new lightweight Bandit + PPO policy modules and LoRA-based efficiency tuning.
- Encoder: Frozen sentence encoder (MiniLM or RoBERTa)
- Projector: 2-layer MLP mapping to LLM embedding space
- Adapter: Injects compressed vectors into LLM input embeddings
Features (per chunk):
- Similarity to query (cosine)
- Novelty vs already-selected chunks
- Reranker score / information gain proxy
- Chunk length (tokens)
- Redundancy with other candidates
Actions: Binary per chunk (expand or keep compressed)
Reward: -perplexity(answer | context) - λ × expanded_tokens + bonus(correctness)
Algorithms:
- LinUCB: Upper confidence bound with linear reward model
- Thompson Sampling: Bayesian approach with posterior sampling
- PPO (optional): Policy gradient with budget constraints
Note: The following are target performance metrics. Actual results will depend on your hardware configuration and training setup.
Expected results on HotpotQA dev set (Llama-3.1-8B-Instruct, 10 retrieved chunks):
| Method | EM | F1 | Avg Tokens | TTFT (s) | Throughput (tok/s) |
|---|---|---|---|---|---|
| Standard RAG | 42.3 | 54.1 | 6500 | 8.2 | 125 |
| Compression-Only | 38.1 | 49.7 | 1200 | 1.5 | 680 |
| RL-Selective (LinUCB) | 41.8 | 53.4 | 2100 | 2.8 | 510 |
| RL-Selective (PPO) | 42.1 | 53.9 | 2300 | 3.1 | 480 |
Target Performance:
- 67% token reduction vs standard RAG
- 3× faster TTFT
- <1% EM/F1 degradation
- PPO slightly better than LinUCB but requires more training
RL-Selective achieves better Pareto frontier than both baselines.
| Method | Evidence Precision | Evidence Recall |
|---|---|---|
| Standard RAG | 0.68 | 0.85 |
| RL-Selective | 0.82 | 0.81 |
Target: RL policy learns to expand chunks containing answer spans.
refrag-lite/
├── configs/ # YAML configurations
├── data/ # Dataset scripts and storage
│ ├── scripts/ # Download and indexing scripts
│ └── hotpotqa/ # Downloaded data (gitignored)
├── refrag/ # Main package
│ ├── utils/ # Utilities (io, logging, metrics, plots)
│ ├── retrieval/ # BM25, dense, hybrid retrieval
│ ├── compress/ # Encoder, projector, mixed context
│ ├── llm/ # LLM loading, inference, adapters
│ ├── train/ # Training scripts (pretrain, SFT)
│ ├── rl/ # RL components (bandit, PPO, rewards)
│ └── eval/ # Evaluation and reporting
├── scripts/ # Shell scripts for pipelines
├── tests/ # Unit tests
├── reports/ # Generated reports and plots
├── checkpoints/ # Model checkpoints (gitignored)
├── Dockerfile # Docker configuration
├── Makefile # Build automation
├── requirements.txt # Pip dependencies
├── environment.yml # Conda environment
└── pyproject.toml # Project metadata
Edit configs/model_llm.yaml:
model_name: "mistralai/Mistral-7B-Instruct-v0.2"
# or "HuggingFaceH4/zephyr-7b-beta"
# or "TinyLlama/TinyLlama-1.1B-Chat-v1.0"Edit configs/model_encoder.yaml:
model_name: "sentence-transformers/all-roberta-large-v1"
# or "intfloat/e5-base-v2"Edit configs/rl_bandit.yaml:
token_budget: 1500 # Lower = more aggressive compression
compression_rate: 0.8 # Compress 80% of chunkspython -m refrag.rl.train_policy --config configs/rl_ppo.yaml --algo ppopytest tests/ -v --cov=refragruff check refrag/
black refrag/
mypy refrag/Implement refrag.rl.policies.Policy interface:
from refrag.rl.policies import Policy
class MyPolicy(Policy):
def select(self, features: np.ndarray, budget: int) -> List[int]:
"""Select chunk indices to expand under budget."""
# Your logic here
return selected_indices- Single-turn QA only: No multi-turn conversation support
- English only: Not tested on multilingual datasets
- Fixed encoder: Encoder frozen during training (not end-to-end)
- Offline RL: No online policy updates during inference
- Simple reward: Perplexity-based, doesn't directly optimize answer metrics
- End-to-end training: Jointly train encoder, projector, and LLM
- Better reward shaping: Direct EM/F1 optimization with REINFORCE
- Multi-turn RAG: Extend to conversational settings with memory
- Domain adaptation: Fine-tune on domain-specific corpora
- Online learning: Update policy based on user feedback
- Hierarchical compression: Multi-level compression (section → paragraph → sentence)
- Learned budget allocation: Policy also decides total budget per query
If you use this code in your research, please cite:
@inproceedings{lin2025refrag,
title = {REFRAG: Rethinking RAG-based Decoding},
author = {Lin, Xiaoqiang and Ghosh, Aritra and Low, Bryan Kian Hsiang and Shrivastava, Anshumali and Mohan, Vijai},
year = {2025},
url = {https://arxiv.org/abs/2509.01092}
}MIT License - see LICENSE file.
- REFRAG (Lin et al., 2025) for introducing the selective compression and expansion paradigm
- HotpotQA dataset: [Yang et al., 2018]
- Sentence Transformers: [Reimers & Gurevych, 2019]
- Hugging Face ecosystem
- OpenAI for ChatGPT-assisted debugging (no API used in code)
For issues, questions, or contributions:
- GitHub Issues: https://github.com/n33levo/refrag-lite/issues
- Discussions: https://github.com/n33levo/refrag-lite/discussions
Q: Can I run this on CPU only?
A: Yes, use tiny.yaml config and set device: cpu. Expect slower training.
Q: Would this work on a Mac?
A: Yes! The project supports MPS (Metal Performance Shaders) for Apple Silicon. Use device: auto and it will automatically detect and use MPS.
Q: How much VRAM do I need? A: Tiny run: 4-6GB. Full run with 8B model: 18-24GB (with PEFT). MacBook M4 with 24GB unified memory works well.
Q: Can I use OpenAI/Anthropic APIs? A: No, this project uses only open-weight models. You could adapt it, but it would break the efficiency benefits.
Q: How long does full training take? A: On single A100: Pretrain (~4h) + SFT (~2h) + RL (~3h) + Eval (~1h) ≈ 10 hours total. On MacBook M4: Expect 2-3× longer.
Q: Does it work with other datasets? A: Yes, but you'll need to adapt data scripts. Tested on HotpotQA; should work with NQ, TriviaQA, MSMARCO, etc.
