A retrieval-augmented generation (RAG) system for academic papers that combines visual and text search with aggressive vector compression. The core contribution is an implementation of TurboQuant — a near-optimal scalar quantization scheme that compresses late-interaction embeddings by 10.6x while retaining 98% Recall@1.
| Metric | FP32 | MSE 3-bit |
|---|---|---|
| Storage | 145.8 MiB | 13.7 MiB (10.6x compression) |
| Search latency (warm) | 7.2 ms/query | 6.5 ms/query |
| Recall@1 | 0.98 | 0.98 |
| Recall@5 | 1.00 | 1.00 |
| MRR | 0.99 | 0.99 |
| Cosine drop | — | 0.015 |
10x less storage, same retrieval quality, and MSE 3-bit (warm) is faster than FP32 search.
| Label | Meaning |
|---|---|
| FP32 | Uncompressed baseline — full 32-bit floating-point vectors |
| MSE 3-bit | TurboQuant MSE quantization at 3 bits per coordinate |
| PROD 3-bit | TurboQuant PROD quantization at 3 bits per coordinate (unbiased inner-product variant) |
| (warm) | Dequant-cache mode: vectors are reconstructed into dense memory once at startup, then searched as FP32. Faster than raw FP32 thanks to smaller cache footprint |
| (direct) | Search on the compressed index directly — no reconstruction step, but higher per-query latency |
| (theory) | Theoretical storage lower bound (bits × values / 8), without codebook or rotation-matrix overhead |
PDF Documents
|
v
+--------------------+ +------------------+
| Text Extraction | | Visual Encoding |
| (native + GLM-OCR) | | (ColEmbed-8B) |
+---------+----------+ +--------+---------+
| |
v v
+--------------------+ +------------------+
| E5 Text Embeddings | | Late-Interaction |
| (768-dim, L2-norm) | | Embeddings |
+---------+----------+ | (4096-dim, var- |
| | length per doc) |
| +--------+---------+
| |
v v
+--------------------------------------------+
| TurboQuant MSE 3-bit |
| (Beta + Lloyd-Max + Random Rotation) |
| 10.6x compression, <1% quality loss |
+---------------------+----------------------+
|
v
+--------------------------------------------+
| Hybrid Retrieval |
| Visual: MaxSim + PLAID candidate pruning |
| Text: Dense (E5) + Lexical (FTS5) |
+---------------------+----------------------+
|
v
+--------------------------------------------+
| RAG Pipeline: Retrieve -> Compress -> |
| Generate (LMStudio) -> Attribute Sources |
+--------------------------------------------+
TurboQuant (based on arXiv:2504.19874): Normalize vectors to unit sphere, apply random orthogonal rotation (RHT/FWHT for power-of-2 dimensions), then scalar-quantize each coordinate with a Lloyd-Max codebook optimized for the Beta distribution of sphere coordinates. 3 bits per coordinate achieve 10.6x compression with theoretical distortion guarantees.
MaxSim Retrieval: ColBERT-style late-interaction scoring where each query token finds its best-matching document token. Quantized search operates in the rotated space via lookup tables, avoiding full reconstruction.
PLAID Pruning: K-Means clustering of document rows with a centroid-based pre-filter reduces the number of candidate documents before exact MaxSim scoring.
Hybrid Text Search: Combines dense retrieval (E5 multilingual embeddings) with lexical fallback (SQLite FTS5) for robust passage-level search.
See MATH_BASIS.md for the full mathematical treatment.
turboquant_mse.py Core quantization: MSE, PROD, pack/unpack, search backends
paper_library.py Document indexing, retrieval, RAG pipeline
chat_ui.py Web UI for conversational search
main.py Demo: TF-IDF vs BM25 vs ColEmbed on Flickr30k
benchmark_turboquant.py Quantization quality/speed benchmark
benchmark_turboquant_compare.py MSE vs PROD comparison benchmark
turboquant_lut.c Native C backend (LUT, tiled BLAS, fastscan)
turboquant_metal.m Apple Metal GPU compute shaders
scripts/
plot_turboquant_results.py Generate benchmark visualizations
benchmark_library_quantization.py Full pipeline benchmark
eval_retrieval_smoke.py Retrieval quality smoke test
tests/
test_turboquant_mse.py 150+ unit tests for quantization pipeline
test_paper_library.py Integration tests for indexing and search
test_chat_ui.py Chat UI handler tests
test_turboquant_prod.py PROD inner-product variant tests
pip install -r requirements.txtCompare TF-IDF, BM25, and ColEmbed multimodal retrieval on Flickr30k:
export COLEMBED_PATH=/path/to/colembed-8b
python main.pypython paper_library.py --papers-dir ./papers --index-dir ./output/my_librarypython chat_ui.py --index-dir ./output/my_libraryThen open http://localhost:8765 in your browser.
python benchmark_turboquant.py
python scripts/plot_turboquant_results.pyPYTHONPATH=. pytest tests/ -qEmpirical MSE falls within the theoretical bounds for all tested bit-widths:
| Bits | Empirical MSE | Lower bound | Upper bound |
|---|---|---|---|
| 1 | 0.363 | 0.250 | 0.681 |
| 2 | 0.117 | 0.063 | 0.170 |
| 3 | 0.035 | 0.016 | 0.043 |
| 4 | 0.011 | 0.004 | 0.011 |
Three search modes are available, each suited to a different use case:
| Mode | Search time (50 queries) | Per-query | Storage | Use case |
|---|---|---|---|---|
| FP32 | 0.36 s | 7.2 ms | 145.8 MiB | Baseline, no compression |
| MSE 3-bit (warm) | 0.32 s | 6.5 ms | 13.7 MiB | Long-running service (reconstruct once at startup) |
| MSE 3-bit (direct) | 2.04 s | 40.8 ms | 13.7 MiB | Cold one-shot query, no reconstruction step |
MSE 3-bit (warm) reconstructs quantized vectors into memory once (1.09 s startup cost), then searches on dense vectors. After warm-up it is faster than FP32 because the smaller index footprint improves CPU cache utilization.
| Stage | Time |
|---|---|
| Query encoding (ColEmbed-8B, 50 queries) | 85.2 s |
| MSE quantization (9332 rows, dim=4096) | 2.2 s |
| One-time dequant warm-up | 1.1 s |
| Warm search (50 queries) | 0.32 s |
Query encoding by the ColEmbed-8B model is the dominant bottleneck. Quantization and search are no longer on the critical path after RHT/FWHT optimization.
| Variable | Default | Description |
|---|---|---|
COLEMBED_PATH |
colembed-8b |
Path to ColEmbed-8B model weights |
TEXT_EMBED_MODEL |
intfloat/multilingual-e5-base |
Text embedding model |
HF_CACHE |
.hf-cache |
Hugging Face cache directory |
GLMOCR_BIN |
.venv-glmocr-sdk/bin/glmocr |
Path to GLM-OCR binary |
LMSTUDIO_BASE_URL |
http://127.0.0.1:1234 |
LMStudio API endpoint |



