Skip to content

feat: scoped retrieval for monorepos (fix low recall on large repos) #61

@rajkumarsakthivel

Description

@rajkumarsakthivel

Problem

Recall@10 drops to 0.07 on monorepos (benchmarked on fiber, 396 files, 4,382 chunks). The retriever's top-10 gets diluted across the entire search space instead of surfacing the specific file.

Proposed fix: scoped retrieval

When a query mentions or implies a subdirectory, scope the search to that subdirectory first. Fall back to full-repo search if scoped results are insufficient.

Approach

  1. Query-time scope detection: extract subdirectory hints from the query (e.g., "how does middleware/logger work?" → scope to middleware/)
  2. File-path prefix filter on vector search: filter chunks by file_path LIKE 'middleware/%' before ranking
  3. Fallback: if scoped search returns fewer than top_k results above confidence threshold, expand to full repo
  4. Optional explicit scope: allow context_search("logger", scope="middleware/") as an MCP tool parameter

Expected impact

  • Monorepo recall should jump from 0.07 to 0.5+ (most queries target a specific package/directory)
  • No impact on single-repo performance (scope detection returns nothing, falls through to full search)

Benchmark plan

Re-run fiber benchmark with scoped retrieval. Target: Recall@10 > 0.50.

Related

  • fiber benchmark: benchmarks/results/fiber.md (Recall@10 = 0.07)
  • chi benchmark: benchmarks/results/chi.md (Recall@10 = 0.67, less affected)
  • PR benchmark: add Go benchmarks for chi and fiber #36 analysis: "Scoping retrieval to a subdirectory based on query context would address this"

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions