Production-grade Retrieval-Augmented Generation API for long-document QA.
Combines FAISS dense search and BM25 keyword search fused with Reciprocal Rank Fusion, powered by GPT-4o.
ArchitectureβΒ·βTech StackβΒ·βQuickstartβΒ·βAPIβΒ·βBenchmarks
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INGESTION PIPELINE β
β β
β PDF/DOCX/TXT βββΊ loader.py βββΊ chunker.py βββΊ embedder.py β
β (PageRecord) (SemanticChunker (OpenAI β
β 95th-pct split) text-emb-3-small) β
β β β
β ββββββββββββββ΄βββββββββββββ β
β βΌ βΌ β
β vector_store.py bm25_store.py β
β (FAISS IndexFlatIP) (BM25Okapi) β
β data/vector_store/ data/bm25_store/ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUERY PIPELINE β
β β
β Question βββΊ get_query_embedding() β
β β (one API call) β
β ββββββββββββ΄βββββββββββ β
β βΌ βΌ β
β VectorStore BM25Store β
β .search(k=10) .search(k=10) β
β cosine ANN token TF-IDF β
β ββββββββββββ¬βββββββββββ β
β βΌ β
β reciprocal_rank_fusion() β
β score(d) = Ξ£ 1/(rank_i(d) + 60) β
β β β
β βΌ β
β top-5 fused chunks β
β β β
β βΌ β
β RAGGenerator.generate() β
β GPT-4o Β· temp=0 Β· strict citation prompt β
β β β
β βΌ β
β {"answer": "...", "sources": [...], "confidence_score": 0.94} β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Layer | Library | Version |
|---|---|---|
| Embedding | OpenAI text-embedding-3-small | openai 1.75.0 |
| Generation | GPT-4o (temp=0) | langchain-openai 1.1.11 |
| Dense index | FAISS IndexFlatIP / IVFFlat | faiss-cpu 1.9.0 |
| Sparse index | BM25Okapi (k1=1.5, b=0.75) | rank-bm25 0.2.2 |
| Chunking | SemanticChunker (95th-pct) | langchain-experimental 0.3.4 |
| Chain | LCEL RunnableSequence | langchain-core 1.2.18 |
| API | FastAPI + uvicorn | 0.115.14 / 0.34.3 |
| Validation | Pydantic v2 | 2.11.4 |
git clone https://github.com/im-anishraj/Hybrid-Search-RAG-Engine.git
cd Hybrid-Search-RAG-Engine
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtcp .env.example .env
# Edit .env and set OPENAI_API_KEY=sk-...uvicorn app.main:app --reload --port 8000OpenAPI docs are available at
http://localhost:8000/docs.
cp .env.example .env # fill in OPENAI_API_KEY
docker compose up --build -d
docker compose logs -fUpload a PDF, DOCX, or TXT file. Additive β each call accumulates into the same corpus without replacing prior documents.
curl -X POST http://localhost:8000/ingest \
-F "file=@annual_report.pdf"View Response
{
"doc_id": "annual_report.pdf",
"filename": "annual_report.pdf",
"pages_loaded": 47,
"chunks_indexed": 112,
"message": "'annual_report.pdf' ingested successfully. 47 pages β 112 chunks indexed."
}curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What was the EBITDA margin in Q3 2023?"}'View Response
{
"question": "What was the EBITDA margin in Q3 2023?",
"answer": "The EBITDA margin in Q3 2023 was 18.5%, driven by operational efficiency improvements [Source: annual_report.pdf, page 15].",
"sources": [
{"filename": "annual_report.pdf", "page_num": 15, "chunk_id": "annual_report.pdf::p15::c2"}
],
"confidence_score": 0.94,
"can_answer": true,
"model": "gpt-4o",
"retrieved_chunks": []
}Hit Rate@5 on a 20-query synthetic corpus (30 chunks, 5 topics)
| Engine | Hit Rate @ 5 | Accuracy |
|---|---|---|
| Hybrid (RRF) | 19 / 20 | 95% |
| BM25 (Sparse only) | 18 / 20 | 90% |
| FAISS (Dense only) | 17 / 20 | 85% |
The hybrid-exclusive hit demonstrates the RRF value: FAISS finds a chunk via semantic similarity while BM25 misses it due to zero lexical overlap. RRF promotes chunks both retrievers agree on.
Run the benchmark locally:
pytest tests/test_retrieval.py -v -s- Why
--workers 1? FAISS's C++ index is not fork-safe. Scale horizontally with multiple containers behind a load balancer instead of multiple workers per container. - Why semantic chunking? Fixed 512-token windows slice sentences mid-thought.
SemanticChunkerdetects topic-shift boundaries via cosine distance spikes, producing one-complete-idea chunks. The LLM receives coherent passages, not sentence fragments. - Why BM25 alongside FAISS? Dense embeddings smear rare tokens. "EBITDA" and "CRISPR-Cas9" map to broad semantic regions shared by adjacent-but-wrong terms. BM25's exact token matching catches them precisely.