feat: optimize vector search with HyDE, semantic reranking, and structured embeddings#231
Conversation
…tured embeddings - Replace fake 'AI reranking' (keyword substring matching) with true LLM-based semantic reranking following the gist reranking pattern - Add HyDE (Hypothetical Document Embedding) query preprocessing for better recall on short/Chinese/ambiguous queries, with 5s timeout and AbortSignal cleanup - Restructure buildEmbeddingText with field labels (Repository:, Description:, Topics:, etc.) and dedup logic - Add lightweight keyword boost for vector search results (name/description/tag exact match bonus) - Make search threshold, topK, HyDE toggle, and reranking toggle user-configurable in VectorSearchSettings UI - Add EMBEDDING_FORMAT_VERSION tracking to force re-index when embedding text format changes - Fix greedy regex in both gist and semantic reranking JSON parsing (.* → .*?) - Fix variable shadowing of translation function t in keyword boost lambdas - Bump default threshold from 0.3 to 0.35 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughAdds HyDE preprocessing and semantic reranking to AI search, introduces embedding format version 2 with forced reindexing support, exposes new vector search controls in settings, and raises the default similarity threshold to 0.35 in the worker and search service. ChangesVector Search Enhancements
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces advanced search enhancements, including HyDE (Hypothetical Document Embedding) query preprocessing, lightweight keyword boosting, and LLM-based semantic reranking, along with a structured embedding text format and new search configuration settings. The review feedback highlights three key issues: first, the embeddingFormatVersion is not updated in the store after indexing, which breaks incremental indexing by always forcing a full rebuild; second, a potential unhandled promise rejection could occur if the HyDE query is aborted after timing out; and third, a defensive check is needed for r.metadata during keyword boosting to prevent potential runtime crashes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| formatVersion: vectorSearchConfig.embeddingFormatVersion, | ||
| currentFormatVersion: EMBEDDING_FORMAT_VERSION, |
There was a problem hiding this comment.
问题: 在 handleRebuildIndex 或 handleIncrementalIndex 成功执行后,Store 中的 embeddingFormatVersion 并没有被更新为最新的 EMBEDDING_FORMAT_VERSION (2)。
后果: 这会导致 vectorSearchConfig.embeddingFormatVersion 始终保持旧版本值(例如 1)。因此,下一次用户执行增量索引时,系统依然会检测到版本不一致(options.formatVersion < options.currentFormatVersion 始终为真),从而被迫对所有仓库进行全量重新索引。这完全使增量索引功能失效,并浪费了大量的 API Token 和时间。
修复建议: 在 handleRebuildIndex 和 handleIncrementalIndex 的成功回调中,调用 setVectorSearchConfig({ embeddingFormatVersion: EMBEDDING_FORMAT_VERSION }) 来更新 Store 中的版本号。同时,记得将 setVectorSearchConfig 添加到这两个 useCallback 的依赖项数组中。
| embeddingQuery = await Promise.race([ | ||
| hydeService.generateHyDEQuery(searchQuery, hydeAbort.signal), | ||
| new Promise<string>((resolve) => { | ||
| hydeTimer = setTimeout(() => { | ||
| hydeAbort.abort(); | ||
| resolve(searchQuery); | ||
| }, 5000); | ||
| }), | ||
| ]); |
There was a problem hiding this comment.
如果超时触发,hydeAbort.abort() 会被调用,这会导致底层的 generateHyDEQuery 请求被中止并抛出 AbortError。由于此时 Promise.race 已经因超时而 resolve,这个异步抛出的 AbortError 将无法被外层的 try-catch 捕获,从而在浏览器中触发“未捕获的 Promise 拒绝”(Unhandled Promise Rejection)警告。
为了优雅地解决这个问题,可以直接在 generateHyDEQuery 后面附加 .catch(() => searchQuery)。这样无论是正常失败还是被中止,它都会安全地回退到原始的 searchQuery,避免任何未捕获的异常。
| embeddingQuery = await Promise.race([ | |
| hydeService.generateHyDEQuery(searchQuery, hydeAbort.signal), | |
| new Promise<string>((resolve) => { | |
| hydeTimer = setTimeout(() => { | |
| hydeAbort.abort(); | |
| resolve(searchQuery); | |
| }, 5000); | |
| }), | |
| ]); | |
| embeddingQuery = await Promise.race([ | |
| hydeService.generateHyDEQuery(searchQuery, hydeAbort.signal).catch(() => searchQuery), | |
| new Promise<string>((resolve) => { | |
| hydeTimer = setTimeout(() => { | |
| hydeAbort.abort(); | |
| resolve(searchQuery); | |
| }, 5000); | |
| }), | |
| ]); |
| const boostedResults = vectorResults.map(r => { | ||
| let bonus = 0; | ||
| const name = (r.metadata.full_name || '').toLowerCase(); | ||
| const desc = (r.metadata.description || '').toLowerCase(); | ||
| const tags = (r.metadata.tags || []).map(tag => tag.toLowerCase()); | ||
| if (name.includes(queryLower)) bonus += 0.05; | ||
| if (desc.includes(queryLower)) bonus += 0.03; | ||
| if (tags.some(tag => tag.includes(queryLower))) bonus += 0.02; | ||
| return { ...r, score: r.score + bonus }; | ||
| }); |
There was a problem hiding this comment.
在对向量搜索结果进行关键词加分时,直接访问了 r.metadata 的属性(如 r.metadata.full_name)。如果由于历史索引数据不一致或 Worker 响应异常导致 r.metadata 为 null 或 undefined,这将会抛出 TypeError 并导致整个搜索功能崩溃。
建议在此处添加防御性代码,确保 r.metadata 存在时再进行属性访问。
| const boostedResults = vectorResults.map(r => { | |
| let bonus = 0; | |
| const name = (r.metadata.full_name || '').toLowerCase(); | |
| const desc = (r.metadata.description || '').toLowerCase(); | |
| const tags = (r.metadata.tags || []).map(tag => tag.toLowerCase()); | |
| if (name.includes(queryLower)) bonus += 0.05; | |
| if (desc.includes(queryLower)) bonus += 0.03; | |
| if (tags.some(tag => tag.includes(queryLower))) bonus += 0.02; | |
| return { ...r, score: r.score + bonus }; | |
| }); | |
| const boostedResults = vectorResults.map(r => { | |
| if (!r.metadata) return r; | |
| let bonus = 0; | |
| const name = (r.metadata.full_name || '').toLowerCase(); | |
| const desc = (r.metadata.description || '').toLowerCase(); | |
| const tags = (r.metadata.tags || []).map(tag => tag.toLowerCase()); | |
| if (name.includes(queryLower)) bonus += 0.05; | |
| if (desc.includes(queryLower)) bonus += 0.03; | |
| if (tags.some(tag => tag.includes(queryLower))) bonus += 0.02; | |
| return { ...r, score: r.score + bonus }; | |
| }); |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/components/SearchBar.tsx (1)
609-632: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winPreserve the LLM rerank order after filtering.
Line 627 calls
applyFilters, which sorts by the active UI sort beforererankSucceededis checked, so successful semantic reranking is effectively overwritten. Reapply an LLM order map after filtering, and cache that order for the replay path that currently sorts only by vector score.Proposed fix direction
+ const rerankOrder = rerankSucceeded + ? new Map(reranked.map((repo, index) => [String(repo.id), index])) + : null; const finalFiltered = applyFilters([...reranked]); - if (!rerankSucceeded) { + if (rerankOrder) { + finalFiltered.sort( + (a, b) => (rerankOrder.get(String(a.id)) ?? Number.MAX_SAFE_INTEGER) + - (rerankOrder.get(String(b.id)) ?? Number.MAX_SAFE_INTEGER) + ); + } else { finalFiltered.sort((a, b) => (scoreMap.get(String(b.id)) ?? 0) - (scoreMap.get(String(a.id)) ?? 0)); }Also extend
vectorScoreMapRefto store the rerank order and use it in the cached re-filter path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/components/SearchBar.tsx` around lines 609 - 632, Preserve the semantic rerank order in SearchBar’s reranking flow: after `AIService.searchRepositoriesWithSemanticReranking` succeeds, `applyFilters([...reranked])` is re-sorting the results by the active UI sort and wiping out the LLM order. Update the `SearchBar` search pipeline to retain a rerank order map (built from the `reranked` array) and reapply that order after `applyFilters`, while keeping the existing vector-score fallback only when `rerankSucceeded` is false. Also extend `vectorScoreMapRef` to cache the rerank order so the replay/cached re-filter path can restore the same LLM ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/components/settings/VectorSearchSettings.tsx`:
- Around line 366-367: The reindex flow in VectorSearchSettings is leaving
format-version upgrades unreachable and not updating the stored embedding
version after a successful run. Adjust the incremental/reindex gating logic so a
pure embedding-format change can trigger a one-shot reindex even when
unindexedCount is 0, and in the same success path persist embeddingFormatVersion
to EMBEDDING_FORMAT_VERSION alongside the existing VectorSearchConfig updates.
Focus the fix around the reindex trigger and save logic that uses
vectorSearchConfig and EMBEDDING_FORMAT_VERSION.
- Around line 948-1035: The new vector search controls need proper accessibility
wiring: in VectorSearchSettings, associate the “Similarity Threshold” and “Top
K” inputs with their labels using stable label/input linkage, and make the HyDE
and Reranking toggle buttons expose an accessible name plus switch state. Update
the controls around formSearchThreshold, formSearchTopK, formEnableHyDE, and
formEnableReranking so screen readers can announce each control’s purpose and
on/off status, using appropriate accessible attributes on the toggle buttons and
label associations for the inputs.
In `@src/services/aiService.ts`:
- Around line 656-686: The semantic reranking flow in
searchRepositoriesWithSemanticReranking should accept and forward an AbortSignal
so SearchBar can cancel or time out the LLM request. Update the method signature
to include signal, then pass that signal through to this.requestText alongside
system, user, temperature, and maxTokens. Make sure any call sites of
searchRepositoriesWithSemanticReranking are updated to supply the existing
signal from the search flow.
In `@src/services/vectorSearchService.ts`:
- Around line 510-512: The format version check in vectorSearchService should
treat absent stored versions as v1 so older vectors still get reindexed. Update
the format-change logic around formatVersionChanged to default a missing
options.formatVersion to 1 before comparing against
options.currentFormatVersion, while keeping the existing behavior when both
versions are present. Use the formatVersionChanged condition and the surrounding
reindex decision in vectorSearchService to locate and adjust the comparison.
---
Outside diff comments:
In `@src/components/SearchBar.tsx`:
- Around line 609-632: Preserve the semantic rerank order in SearchBar’s
reranking flow: after `AIService.searchRepositoriesWithSemanticReranking`
succeeds, `applyFilters([...reranked])` is re-sorting the results by the active
UI sort and wiping out the LLM order. Update the `SearchBar` search pipeline to
retain a rerank order map (built from the `reranked` array) and reapply that
order after `applyFilters`, while keeping the existing vector-score fallback
only when `rerankSucceeded` is false. Also extend `vectorScoreMapRef` to cache
the rerank order so the replay/cached re-filter path can restore the same LLM
ordering.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3629195f-1a26-44b1-8c0a-dadcd735bb48
📒 Files selected for processing (7)
cloudflare-worker/src/index.tssrc/components/SearchBar.tsxsrc/components/settings/VectorSearchSettings.tsxsrc/services/aiService.tssrc/services/vectorSearchService.tssrc/store/useAppStore.tssrc/types/index.ts
- Fix HyDE AbortError: catch abort on generateHyDEQuery to prevent unhandled rejection - Fix r.metadata null safety: use optional chaining for keyword boost - Fix formatVersionChanged: default missing version to v1 so older vectors get reindexed - Fix applyFilters overwriting LLM rerank order: preserve rerankOrder map and reapply after filtering - Fix signal forwarding: pass AbortSignal through semantic reranking to requestText - Fix incremental index button: keep enabled when format version needs upgrade - Fix embeddingFormatVersion not persisted: update store after both rebuild and incremental index - Add accessibility: htmlFor/id for inputs, role/aria-checked for toggle switches Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Summary
优化向量搜索管线的 5 个关键环节,显著提升搜索匹配精度。
核心改动
Repository:、Description:、Topics:等),帮助 embedding 模型理解字段角色,并自动去重 description/summary 重叠内容Bug 修复
/\[[\s\S]*\]/→/\[[\s\S]*?\]/),避免 LLM 输出多段括号内容时解析失败t遮蔽翻译函数的问题EMBEDDING_FORMAT_VERSION追踪,嵌入文本格式变化时增量索引自动触发全量重建,避免混合格式向量索引影响范围
src/services/aiService.tsgenerateHyDEQuery()、searchRepositoriesWithSemanticReranking()src/services/vectorSearchService.tsbuildEmbeddingText()、新增EMBEDDING_FORMAT_VERSION、调整默认阈值src/components/SearchBar.tsxsrc/components/settings/VectorSearchSettings.tsxsrc/types/index.tssrc/store/useAppStore.tscloudflare-worker/src/index.ts部署注意
首次部署后建议触发一次「重建向量索引」(非增量),使所有向量使用新的结构化文本格式。
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes