AniQ: Quantized Multimodal Document Retrieval

A retrieval-augmented generation (RAG) system for academic papers that combines visual and text search with aggressive vector compression. The core contribution is an implementation of TurboQuant — a near-optimal scalar quantization scheme that compresses late-interaction embeddings by 10.6x while retaining 98% Recall@1.

Key Results

Metric	FP32	MSE 3-bit
Storage	145.8 MiB	13.7 MiB (10.6x compression)
Search latency (warm)	7.2 ms/query	6.5 ms/query
Recall@1	0.98	0.98
Recall@5	1.00	1.00
MRR	0.99	0.99
Cosine drop	—	0.015

10x less storage, same retrieval quality, and MSE 3-bit (warm) is faster than FP32 search.

Terminology

Label	Meaning
FP32	Uncompressed baseline — full 32-bit floating-point vectors
MSE 3-bit	TurboQuant MSE quantization at 3 bits per coordinate
PROD 3-bit	TurboQuant PROD quantization at 3 bits per coordinate (unbiased inner-product variant)
(warm)	Dequant-cache mode: vectors are reconstructed into dense memory once at startup, then searched as FP32. Faster than raw FP32 thanks to smaller cache footprint
(direct)	Search on the compressed index directly — no reconstruction step, but higher per-query latency
(theory)	Theoretical storage lower bound (`bits × values / 8`), without codebook or rotation-matrix overhead

Architecture

PDF Documents
     |
     v
+--------------------+     +------------------+
| Text Extraction    |     | Visual Encoding  |
| (native + GLM-OCR) |     | (ColEmbed-8B)    |
+---------+----------+     +--------+---------+
          |                          |
          v                          v
+--------------------+     +------------------+
| E5 Text Embeddings |     | Late-Interaction |
| (768-dim, L2-norm) |     | Embeddings       |
+---------+----------+     | (4096-dim, var-  |
          |                | length per doc)  |
          |                +--------+---------+
          |                         |
          v                         v
+--------------------------------------------+
|          TurboQuant MSE 3-bit              |
|  (Beta + Lloyd-Max + Random Rotation)      |
|  10.6x compression, <1% quality loss       |
+---------------------+----------------------+
                      |
                      v
+--------------------------------------------+
|        Hybrid Retrieval                    |
|  Visual: MaxSim + PLAID candidate pruning  |
|  Text: Dense (E5) + Lexical (FTS5)        |
+---------------------+----------------------+
                      |
                      v
+--------------------------------------------+
|   RAG Pipeline: Retrieve -> Compress ->    |
|   Generate (LMStudio) -> Attribute Sources |
+--------------------------------------------+

Methods

TurboQuant (based on arXiv:2504.19874): Normalize vectors to unit sphere, apply random orthogonal rotation (RHT/FWHT for power-of-2 dimensions), then scalar-quantize each coordinate with a Lloyd-Max codebook optimized for the Beta distribution of sphere coordinates. 3 bits per coordinate achieve 10.6x compression with theoretical distortion guarantees.

MaxSim Retrieval: ColBERT-style late-interaction scoring where each query token finds its best-matching document token. Quantized search operates in the rotated space via lookup tables, avoiding full reconstruction.

PLAID Pruning: K-Means clustering of document rows with a centroid-based pre-filter reduces the number of candidate documents before exact MaxSim scoring.

Hybrid Text Search: Combines dense retrieval (E5 multilingual embeddings) with lexical fallback (SQLite FTS5) for robust passage-level search.

See MATH_BASIS.md for the full mathematical treatment.

Project Structure

turboquant_mse.py              Core quantization: MSE, PROD, pack/unpack, search backends
paper_library.py               Document indexing, retrieval, RAG pipeline
chat_ui.py                     Web UI for conversational search
main.py                        Demo: TF-IDF vs BM25 vs ColEmbed on Flickr30k

benchmark_turboquant.py        Quantization quality/speed benchmark
benchmark_turboquant_compare.py MSE vs PROD comparison benchmark

turboquant_lut.c               Native C backend (LUT, tiled BLAS, fastscan)
turboquant_metal.m             Apple Metal GPU compute shaders

scripts/
  plot_turboquant_results.py   Generate benchmark visualizations
  benchmark_library_quantization.py  Full pipeline benchmark
  eval_retrieval_smoke.py      Retrieval quality smoke test

tests/
  test_turboquant_mse.py       150+ unit tests for quantization pipeline
  test_paper_library.py        Integration tests for indexing and search
  test_chat_ui.py              Chat UI handler tests
  test_turboquant_prod.py      PROD inner-product variant tests

Quick Start

Installation

pip install -r requirements.txt

Run the demo

Compare TF-IDF, BM25, and ColEmbed multimodal retrieval on Flickr30k:

export COLEMBED_PATH=/path/to/colembed-8b
python main.py

Index your own papers

python paper_library.py --papers-dir ./papers --index-dir ./output/my_library

Start the web UI

python chat_ui.py --index-dir ./output/my_library

Then open http://localhost:8765 in your browser.

Run benchmarks

python benchmark_turboquant.py
python scripts/plot_turboquant_results.py

Run tests

PYTHONPATH=. pytest tests/ -q

Benchmarks

Results Dashboard

Distortion-Rate

Empirical MSE falls within the theoretical bounds for all tested bit-widths:

Bits	Empirical MSE	Lower bound	Upper bound
1	0.363	0.250	0.681
2	0.117	0.063	0.170
3	0.035	0.016	0.043
4	0.011	0.004	0.011

Compression vs Quality

Backend Performance

Search Latency

Three search modes are available, each suited to a different use case:

Mode	Search time (50 queries)	Per-query	Storage	Use case
FP32	0.36 s	7.2 ms	145.8 MiB	Baseline, no compression
MSE 3-bit (warm)	0.32 s	6.5 ms	13.7 MiB	Long-running service (reconstruct once at startup)
MSE 3-bit (direct)	2.04 s	40.8 ms	13.7 MiB	Cold one-shot query, no reconstruction step

MSE 3-bit (warm) reconstructs quantized vectors into memory once (1.09 s startup cost), then searches on dense vectors. After warm-up it is faster than FP32 because the smaller index footprint improves CPU cache utilization.

End-to-End Timing

Stage	Time
Query encoding (ColEmbed-8B, 50 queries)	85.2 s
MSE quantization (9332 rows, dim=4096)	2.2 s
One-time dequant warm-up	1.1 s
Warm search (50 queries)	0.32 s

Query encoding by the ColEmbed-8B model is the dominant bottleneck. Quantization and search are no longer on the critical path after RHT/FWHT optimization.

Environment Variables

Variable	Default	Description
`COLEMBED_PATH`	`colembed-8b`	Path to ColEmbed-8B model weights
`TEXT_EMBED_MODEL`	`intfloat/multilingual-e5-base`	Text embedding model
`HF_CACHE`	`.hf-cache`	Hugging Face cache directory
`GLMOCR_BIN`	`.venv-glmocr-sdk/bin/glmocr`	Path to GLM-OCR binary
`LMSTUDIO_BASE_URL`	`http://127.0.0.1:1234`	LMStudio API endpoint

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AniQ: Quantized Multimodal Document Retrieval

Key Results

Terminology

Architecture

Methods

Project Structure

Quick Start

Installation

Run the demo

Index your own papers

Start the web UI

Run benchmarks

Run tests

Benchmarks

Results Dashboard

Distortion-Rate

Compression vs Quality

Backend Performance

Search Latency

End-to-End Timing

Environment Variables

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs/figures		docs/figures
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MATH_BASIS.md		MATH_BASIS.md
README.md		README.md
benchmark_turboquant.py		benchmark_turboquant.py
benchmark_turboquant_compare.py		benchmark_turboquant_compare.py
chat_ui.py		chat_ui.py
main.py		main.py
paper_library.py		paper_library.py
requirements.txt		requirements.txt
turboquant_lut.c		turboquant_lut.c
turboquant_math_basis_ru.md		turboquant_math_basis_ru.md
turboquant_metal.m		turboquant_metal.m
turboquant_mse.py		turboquant_mse.py

Folders and files

Latest commit

History

Repository files navigation

AniQ: Quantized Multimodal Document Retrieval

Key Results

Terminology

Architecture

Methods

Project Structure

Quick Start

Installation

Run the demo

Index your own papers

Start the web UI

Run benchmarks

Run tests

Benchmarks

Results Dashboard

Distortion-Rate

Compression vs Quality

Backend Performance

Search Latency

End-to-End Timing

Environment Variables

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages