Skip to content

NikYak228/AniQ

Repository files navigation

AniQ: Quantized Multimodal Document Retrieval

A retrieval-augmented generation (RAG) system for academic papers that combines visual and text search with aggressive vector compression. The core contribution is an implementation of TurboQuant — a near-optimal scalar quantization scheme that compresses late-interaction embeddings by 10.6x while retaining 98% Recall@1.

Key Results

Metric FP32 MSE 3-bit
Storage 145.8 MiB 13.7 MiB (10.6x compression)
Search latency (warm) 7.2 ms/query 6.5 ms/query
Recall@1 0.98 0.98
Recall@5 1.00 1.00
MRR 0.99 0.99
Cosine drop 0.015

10x less storage, same retrieval quality, and MSE 3-bit (warm) is faster than FP32 search.

Terminology

Label Meaning
FP32 Uncompressed baseline — full 32-bit floating-point vectors
MSE 3-bit TurboQuant MSE quantization at 3 bits per coordinate
PROD 3-bit TurboQuant PROD quantization at 3 bits per coordinate (unbiased inner-product variant)
(warm) Dequant-cache mode: vectors are reconstructed into dense memory once at startup, then searched as FP32. Faster than raw FP32 thanks to smaller cache footprint
(direct) Search on the compressed index directly — no reconstruction step, but higher per-query latency
(theory) Theoretical storage lower bound (bits × values / 8), without codebook or rotation-matrix overhead

Architecture

PDF Documents
     |
     v
+--------------------+     +------------------+
| Text Extraction    |     | Visual Encoding  |
| (native + GLM-OCR) |     | (ColEmbed-8B)    |
+---------+----------+     +--------+---------+
          |                          |
          v                          v
+--------------------+     +------------------+
| E5 Text Embeddings |     | Late-Interaction |
| (768-dim, L2-norm) |     | Embeddings       |
+---------+----------+     | (4096-dim, var-  |
          |                | length per doc)  |
          |                +--------+---------+
          |                         |
          v                         v
+--------------------------------------------+
|          TurboQuant MSE 3-bit              |
|  (Beta + Lloyd-Max + Random Rotation)      |
|  10.6x compression, <1% quality loss       |
+---------------------+----------------------+
                      |
                      v
+--------------------------------------------+
|        Hybrid Retrieval                    |
|  Visual: MaxSim + PLAID candidate pruning  |
|  Text: Dense (E5) + Lexical (FTS5)        |
+---------------------+----------------------+
                      |
                      v
+--------------------------------------------+
|   RAG Pipeline: Retrieve -> Compress ->    |
|   Generate (LMStudio) -> Attribute Sources |
+--------------------------------------------+

Methods

TurboQuant (based on arXiv:2504.19874): Normalize vectors to unit sphere, apply random orthogonal rotation (RHT/FWHT for power-of-2 dimensions), then scalar-quantize each coordinate with a Lloyd-Max codebook optimized for the Beta distribution of sphere coordinates. 3 bits per coordinate achieve 10.6x compression with theoretical distortion guarantees.

MaxSim Retrieval: ColBERT-style late-interaction scoring where each query token finds its best-matching document token. Quantized search operates in the rotated space via lookup tables, avoiding full reconstruction.

PLAID Pruning: K-Means clustering of document rows with a centroid-based pre-filter reduces the number of candidate documents before exact MaxSim scoring.

Hybrid Text Search: Combines dense retrieval (E5 multilingual embeddings) with lexical fallback (SQLite FTS5) for robust passage-level search.

See MATH_BASIS.md for the full mathematical treatment.

Project Structure

turboquant_mse.py              Core quantization: MSE, PROD, pack/unpack, search backends
paper_library.py               Document indexing, retrieval, RAG pipeline
chat_ui.py                     Web UI for conversational search
main.py                        Demo: TF-IDF vs BM25 vs ColEmbed on Flickr30k

benchmark_turboquant.py        Quantization quality/speed benchmark
benchmark_turboquant_compare.py MSE vs PROD comparison benchmark

turboquant_lut.c               Native C backend (LUT, tiled BLAS, fastscan)
turboquant_metal.m             Apple Metal GPU compute shaders

scripts/
  plot_turboquant_results.py   Generate benchmark visualizations
  benchmark_library_quantization.py  Full pipeline benchmark
  eval_retrieval_smoke.py      Retrieval quality smoke test

tests/
  test_turboquant_mse.py       150+ unit tests for quantization pipeline
  test_paper_library.py        Integration tests for indexing and search
  test_chat_ui.py              Chat UI handler tests
  test_turboquant_prod.py      PROD inner-product variant tests

Quick Start

Installation

pip install -r requirements.txt

Run the demo

Compare TF-IDF, BM25, and ColEmbed multimodal retrieval on Flickr30k:

export COLEMBED_PATH=/path/to/colembed-8b
python main.py

Index your own papers

python paper_library.py --papers-dir ./papers --index-dir ./output/my_library

Start the web UI

python chat_ui.py --index-dir ./output/my_library

Then open http://localhost:8765 in your browser.

Run benchmarks

python benchmark_turboquant.py
python scripts/plot_turboquant_results.py

Run tests

PYTHONPATH=. pytest tests/ -q

Benchmarks

Results Dashboard

TurboQuant Results Dashboard

Distortion-Rate

Empirical MSE falls within the theoretical bounds for all tested bit-widths:

Distortion-Rate: Theory vs Implementation

Bits Empirical MSE Lower bound Upper bound
1 0.363 0.250 0.681
2 0.117 0.063 0.170
3 0.035 0.016 0.043
4 0.011 0.004 0.011

Compression vs Quality

Compression vs Retrieval Quality

Backend Performance

Paper Library Backend Performance

Search Latency

Three search modes are available, each suited to a different use case:

Mode Search time (50 queries) Per-query Storage Use case
FP32 0.36 s 7.2 ms 145.8 MiB Baseline, no compression
MSE 3-bit (warm) 0.32 s 6.5 ms 13.7 MiB Long-running service (reconstruct once at startup)
MSE 3-bit (direct) 2.04 s 40.8 ms 13.7 MiB Cold one-shot query, no reconstruction step

MSE 3-bit (warm) reconstructs quantized vectors into memory once (1.09 s startup cost), then searches on dense vectors. After warm-up it is faster than FP32 because the smaller index footprint improves CPU cache utilization.

End-to-End Timing

Stage Time
Query encoding (ColEmbed-8B, 50 queries) 85.2 s
MSE quantization (9332 rows, dim=4096) 2.2 s
One-time dequant warm-up 1.1 s
Warm search (50 queries) 0.32 s

Query encoding by the ColEmbed-8B model is the dominant bottleneck. Quantization and search are no longer on the critical path after RHT/FWHT optimization.

Environment Variables

Variable Default Description
COLEMBED_PATH colembed-8b Path to ColEmbed-8B model weights
TEXT_EMBED_MODEL intfloat/multilingual-e5-base Text embedding model
HF_CACHE .hf-cache Hugging Face cache directory
GLMOCR_BIN .venv-glmocr-sdk/bin/glmocr Path to GLM-OCR binary
LMSTUDIO_BASE_URL http://127.0.0.1:1234 LMStudio API endpoint

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors