refrag-lite: RL-Selective Expansion for Token-Efficient RAG

(An open-source reimplementation and extension of REFRAG: Rethinking RAG-based Decoding (Lin et al., 2025))

refrag-lite is a compact, reproducible reimplementation of the REFRAG framework that introduces reinforcement-learning-based selective expansion for Retrieval-Augmented Generation (RAG).
While REFRAG established the core idea of compressing most retrieved chunks into dense vectors and selectively expanding a few, refrag-lite focuses on practical usability: one-GPU compatibility, modular RL policy design (Bandit + PPO), and open-weight reproducibility.

Overview

Traditional RAG pipelines expand all retrieved chunks into full text tokens—expensive and latency-heavy.
Following the insight from REFRAG (Lin et al., 2025), this project applies token-efficient context construction: most chunks are injected as single learned vectors, while a lightweight RL policy decides which to expand fully under a token budget.

Query → Retrieve K chunks → RL Policy → Expand few, compress rest → LLM → Answer
                                ↓
                        Most chunks: 1 vector each
                        Selected chunks: Full tokens

Key Features

Compression (from REFRAG): Encode most chunks as dense vectors fed directly to the LLM
RL-Selective Expansion (extended): Learn which chunks to expand for the best quality/cost trade-off
Token Budget Control (new): Explicit configurable limits with adaptive policy sampling
Multi-Stage Training: Reconstruction → Continual Pretraining → SFT → RL
Bandit + PPO: Start with LinUCB/Thompson Sampling, optionally upgrade to PPO
One-GPU Friendly: Runs on 24GB VRAM with PEFT/LoRA
Fully Open-Source: No paid APIs, all models open-weight

Quick Start (Tiny Demo - 30 min)

# Setup
make setup

# Download data and build indexes (tiny subset)
make data-tiny

# Run end-to-end tiny pipeline
make tiny

# View results
cat reports/tiny_run_summary.md

Expected tiny run results (TinyLlama-1.1B on HotpotQA dev subset):

Baseline RAG: EM ~15%, TTFT ~2.5s, Tokens ~4000
Compression-Only: EM ~12%, TTFT ~0.8s, Tokens ~800
RL-Selective: EM ~25%, TTFT ~1.0s, Tokens ~1500 (50% savings, 2× faster)

Current Performance (Groq API + Real Data):

EM Score: 25% (exact matches)
F1 Score: 30% (partial matches)
TTFT: 1.03s (time to first token)
Throughput: 514 tok/s
Memory: 7.0 GB

Full Installation

Requirements

Python 3.10+
PyTorch 2.3+
GPU acceleration: CUDA 11.8+ (NVIDIA) or MPS (Apple Silicon) or CPU
24GB VRAM recommended for full runs (8B models)
50GB disk space for data/models

Option 1: Conda (Recommended)

conda env create -f environment.yml
conda activate refrag-lite

Option 2: Pip

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt

Option 3: Docker

docker build -t refrag-lite .
docker run -it -v $(pwd)/data:/app/data refrag-lite
# Note: --gpus all only needed for NVIDIA GPUs

Full Pipeline

1. Data Preparation

# Download HotpotQA
python data/scripts/download_hotpotqa.py --output data/hotpotqa

# Build BM25 and FAISS indexes
python data/scripts/build_corpus.py --bm25 --dense --output data/indexes

2. Pretrain Compression

# Stage 1: Token reconstruction from compressed vectors
python -m refrag.train.pretrain_recon --config configs/default.yaml

# Stage 2: Continual pretraining with curriculum
python -m refrag.train.pretrain_cpt --config configs/default.yaml

3. Supervised Fine-Tuning

# Fine-tune LLM with mixed inputs (compressed + expanded)
python -m refrag.train.sft_qa --config configs/default.yaml

4. Train RL Policy

# Train bandit policy (LinUCB or Thompson Sampling)
python -m refrag.rl.train_policy --config configs/rl_bandit.yaml

# Optional: Train PPO policy
python -m refrag.rl.train_policy --config configs/rl_ppo.yaml --algo ppo

5. Evaluation

# Run baselines
bash scripts/run_baselines.sh

# Evaluate QA performance
python -m refrag.eval.qa_eval --config configs/eval.yaml

# Evaluate speed/efficiency
python -m refrag.eval.speed_eval --config configs/eval.yaml

# Generate report with plots
python -m refrag.eval.report --config configs/eval.yaml

6. Export Policy

python scripts/export_policy.py --checkpoint checkpoints/bandit_best.pt --out policy.bin

Makefile Targets

make setup        # Create environment and install dependencies
make data         # Download and index full HotpotQA dataset
make data-tiny    # Download and index tiny subset (for quick testing)
make tiny         # Run complete tiny pipeline
make train        # Run pretrain + SFT stages
make rl           # Train RL policy
make eval         # Run evaluation and generate report
make test         # Run unit tests
make clean        # Clean generated files
make docker-build # Build Docker image
make docker-run   # Run Docker container

Configuration

All configuration files are in configs/. Key files:

tiny.yaml: Fast CPU/GPU demo configuration
default.yaml: Full 1-GPU configuration
model_llm.yaml: LLM settings (Llama/Mistral/Zephyr)
model_encoder.yaml: Encoder + projector settings
retriever.yaml: BM25 + dense retrieval config
rl_bandit.yaml: LinUCB/Thompson Sampling parameters
rl_ppo.yaml: PPO parameters
eval.yaml: Evaluation thresholds and plotting config

Key Configuration Options

# Model selection
llm:
  model_name: "meta-llama/Llama-3.1-8B-Instruct"  # or TinyLlama, Mistral, Zephyr
  use_peft: true
  lora_r: 16

encoder:
  model_name: "sentence-transformers/all-MiniLM-L6-v2"  # or roberta-large

# Token budget
rl:
  token_budget: 2000
  compression_rate: 0.75  # Compress 75% of chunks
  top_k_candidates: 10

# Reward function
reward:
  perplexity_weight: -1.0
  token_penalty: 0.001
  correctness_bonus: 1.0

Architecture

Architecture adapted and simplified from the REFRAG framework (Lin et al., 2025), with new lightweight Bandit + PPO policy modules and LoRA-based efficiency tuning.

Compression Path

Encoder: Frozen sentence encoder (MiniLM or RoBERTa)
Projector: 2-layer MLP mapping to LLM embedding space
Adapter: Injects compressed vectors into LLM input embeddings

RL Policy

Features (per chunk):

Similarity to query (cosine)
Novelty vs already-selected chunks
Reranker score / information gain proxy
Chunk length (tokens)
Redundancy with other candidates

Actions: Binary per chunk (expand or keep compressed)

Reward: -perplexity(answer | context) - λ × expanded_tokens + bonus(correctness)

Algorithms:

LinUCB: Upper confidence bound with linear reward model
Thompson Sampling: Bayesian approach with posterior sampling
PPO (optional): Policy gradient with budget constraints

Results (Full Run)

Note: The following are target performance metrics. Actual results will depend on your hardware configuration and training setup.

Expected results on HotpotQA dev set (Llama-3.1-8B-Instruct, 10 retrieved chunks):

Method	EM	F1	Avg Tokens	TTFT (s)	Throughput (tok/s)
Standard RAG	42.3	54.1	6500	8.2	125
Compression-Only	38.1	49.7	1200	1.5	680
RL-Selective (LinUCB)	41.8	53.4	2100	2.8	510
RL-Selective (PPO)	42.1	53.9	2300	3.1	480

Target Performance:

67% token reduction vs standard RAG
3× faster TTFT
<1% EM/F1 degradation
PPO slightly better than LinUCB but requires more training

Quality vs Cost Curve

RL-Selective achieves better Pareto frontier than both baselines.

Evidence Precision/Recall

Method	Evidence Precision	Evidence Recall
Standard RAG	0.68	0.85
RL-Selective	0.82	0.81

Target: RL policy learns to expand chunks containing answer spans.

Project Structure

refrag-lite/
├── configs/              # YAML configurations
├── data/                 # Dataset scripts and storage
│   ├── scripts/         # Download and indexing scripts
│   └── hotpotqa/        # Downloaded data (gitignored)
├── refrag/              # Main package
│   ├── utils/           # Utilities (io, logging, metrics, plots)
│   ├── retrieval/       # BM25, dense, hybrid retrieval
│   ├── compress/        # Encoder, projector, mixed context
│   ├── llm/             # LLM loading, inference, adapters
│   ├── train/           # Training scripts (pretrain, SFT)
│   ├── rl/              # RL components (bandit, PPO, rewards)
│   └── eval/            # Evaluation and reporting
├── scripts/             # Shell scripts for pipelines
├── tests/               # Unit tests
├── reports/             # Generated reports and plots
├── checkpoints/         # Model checkpoints (gitignored)
├── Dockerfile           # Docker configuration
├── Makefile             # Build automation
├── requirements.txt     # Pip dependencies
├── environment.yml      # Conda environment
└── pyproject.toml       # Project metadata

Customization

Swap LLM

Edit configs/model_llm.yaml:

model_name: "mistralai/Mistral-7B-Instruct-v0.2"
# or "HuggingFaceH4/zephyr-7b-beta"
# or "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Swap Encoder

Edit configs/model_encoder.yaml:

model_name: "sentence-transformers/all-roberta-large-v1"
# or "intfloat/e5-base-v2"

Adjust Token Budget

Edit configs/rl_bandit.yaml:

token_budget: 1500  # Lower = more aggressive compression
compression_rate: 0.8  # Compress 80% of chunks

Enable PPO

python -m refrag.rl.train_policy --config configs/rl_ppo.yaml --algo ppo

Development

Run Tests

pytest tests/ -v --cov=refrag

Code Quality

ruff check refrag/
black refrag/
mypy refrag/

Add New Policy

Implement refrag.rl.policies.Policy interface:

from refrag.rl.policies import Policy

class MyPolicy(Policy):
    def select(self, features: np.ndarray, budget: int) -> List[int]:
        """Select chunk indices to expand under budget."""
        # Your logic here
        return selected_indices

Limitations & Future Work

Current Limitations

Single-turn QA only: No multi-turn conversation support
English only: Not tested on multilingual datasets
Fixed encoder: Encoder frozen during training (not end-to-end)
Offline RL: No online policy updates during inference
Simple reward: Perplexity-based, doesn't directly optimize answer metrics

Future Directions

End-to-end training: Jointly train encoder, projector, and LLM
Better reward shaping: Direct EM/F1 optimization with REINFORCE
Multi-turn RAG: Extend to conversational settings with memory
Domain adaptation: Fine-tune on domain-specific corpora
Online learning: Update policy based on user feedback
Hierarchical compression: Multi-level compression (section → paragraph → sentence)
Learned budget allocation: Policy also decides total budget per query

Citation

If you use this code in your research, please cite:

@inproceedings{lin2025refrag,
  title = {REFRAG: Rethinking RAG-based Decoding},
  author = {Lin, Xiaoqiang and Ghosh, Aritra and Low, Bryan Kian Hsiang and Shrivastava, Anshumali and Mohan, Vijai},
  year = {2025},
  url = {https://arxiv.org/abs/2509.01092}
}

License

MIT License - see LICENSE file.

Acknowledgments

REFRAG (Lin et al., 2025) for introducing the selective compression and expansion paradigm
HotpotQA dataset: [Yang et al., 2018]
Sentence Transformers: [Reimers & Gurevych, 2019]
Hugging Face ecosystem
OpenAI for ChatGPT-assisted debugging (no API used in code)

Support

For issues, questions, or contributions:

GitHub Issues: https://github.com/n33levo/refrag-lite/issues
Discussions: https://github.com/n33levo/refrag-lite/discussions

FAQ

Q: Can I run this on CPU only? A: Yes, use tiny.yaml config and set device: cpu. Expect slower training.

Q: Would this work on a Mac? A: Yes! The project supports MPS (Metal Performance Shaders) for Apple Silicon. Use device: auto and it will automatically detect and use MPS.

Q: How much VRAM do I need? A: Tiny run: 4-6GB. Full run with 8B model: 18-24GB (with PEFT). MacBook M4 with 24GB unified memory works well.

Q: Can I use OpenAI/Anthropic APIs? A: No, this project uses only open-weight models. You could adapt it, but it would break the efficiency benefits.

Q: How long does full training take? A: On single A100: Pretrain (~4h) + SFT (~2h) + RL (~3h) + Eval (~1h) ≈ 10 hours total. On MacBook M4: Expect 2-3× longer.

Q: Does it work with other datasets? A: Yes, but you'll need to adapt data scripts. Tested on HotpotQA; should work with NQ, TriviaQA, MSMARCO, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
plots		plots
refrag		refrag
scripts		scripts
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_improved_pipeline.sh		run_improved_pipeline.sh

License

n33levo/refrag-lite

Folders and files

Latest commit

History

Repository files navigation

refrag-lite: RL-Selective Expansion for Token-Efficient RAG

Overview

Key Features

Quick Start (Tiny Demo - 30 min)

Full Installation

Requirements

Option 1: Conda (Recommended)

Option 2: Pip

Option 3: Docker

Full Pipeline

1. Data Preparation

2. Pretrain Compression

3. Supervised Fine-Tuning

4. Train RL Policy

5. Evaluation

6. Export Policy

Makefile Targets

Configuration

Key Configuration Options

Architecture

Compression Path

RL Policy

Results (Full Run)

Quality vs Cost Curve

Evidence Precision/Recall

Project Structure

Customization

Swap LLM

Swap Encoder

Adjust Token Budget

Enable PPO

Development

Run Tests

Code Quality

Add New Policy

Limitations & Future Work

Current Limitations

Future Directions

Citation

License

Acknowledgments

Support

FAQ

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages