Skip to content

feat: optimize vector search with HyDE, semantic reranking, and structured embeddings#231

Merged
AmintaCCCP merged 2 commits into
mainfrom
feature/optimize-vector-search
Jun 27, 2026
Merged

feat: optimize vector search with HyDE, semantic reranking, and structured embeddings#231
AmintaCCCP merged 2 commits into
mainfrom
feature/optimize-vector-search

Conversation

@AmintaCCCP

@AmintaCCCP AmintaCCCP commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Summary

优化向量搜索管线的 5 个关键环节,显著提升搜索匹配精度。

核心改动

  1. 真正的语义重排序 — 替换原来的"假 AI 重排序"(LLM 提取关键词后子串匹配过滤,反而破坏向量搜索质量),改为参照 gist 重排序的模式:将候选仓库摘要发送给 LLM,按语义相关性排序返回 ID 列表
  2. HyDE 查询预处理 — 根据用户查询生成"理想仓库描述"再嵌入向量,对短查询、中文查询、概念查询(如"something like Notion")召回率显著提升,带 5 秒超时 + AbortSignal 清理
  3. 结构化嵌入文本 — 为 buildEmbeddingText 添加语义标签(Repository:Description:Topics: 等),帮助 embedding 模型理解字段角色,并自动去重 description/summary 重叠内容
  4. 轻量关键词加分 — 向量搜索结果返回后,对精确匹配的 name/description/tags 给予微小分数加分
  5. 搜索参数可配置 — 相似度阈值、TopK、HyDE 开关、重排序开关均可在 UI 中调整

Bug 修复

  • 修复 gist 重排序和语义重排序中 JSON 解析的贪婪正则(/\[[\s\S]*\]//\[[\s\S]*?\]/),避免 LLM 输出多段括号内容时解析失败
  • 修复关键词加分中变量 t 遮蔽翻译函数的问题
  • 添加 EMBEDDING_FORMAT_VERSION 追踪,嵌入文本格式变化时增量索引自动触发全量重建,避免混合格式向量索引
  • HyDE 超时后正确 abort 底层 HTTP 请求并清理 timer

影响范围

文件 改动
src/services/aiService.ts 新增 generateHyDEQuery()searchRepositoriesWithSemanticReranking()
src/services/vectorSearchService.ts 重构 buildEmbeddingText()、新增 EMBEDDING_FORMAT_VERSION、调整默认阈值
src/components/SearchBar.tsx 集成 HyDE、切换到真正语义重排序、关键词加分
src/components/settings/VectorSearchSettings.tsx 搜索参数配置 UI
src/types/index.ts VectorSearchConfig 新增字段
src/store/useAppStore.ts 迁移逻辑
cloudflare-worker/src/index.ts 同步默认阈值

部署注意

首次部署后建议触发一次「重建向量索引」(非增量),使所有向量使用新的结构化文本格式。

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added persisted vector search controls, including configurable result threshold, Top-K, optional HyDE preprocessing, and optional semantic reranking.
    • Search relevance is improved with keyword boosting and conditional LLM-based semantic reranking (with safe fallback to vector ranking).
  • Bug Fixes

    • Updated the default similarity threshold to 0.35 and migrated older saved settings to include new search parameters.
    • Improved consistency between stored embeddings and indexing by introducing an embedding format version and triggering reindexing when formats change.

…tured embeddings

- Replace fake 'AI reranking' (keyword substring matching) with true LLM-based semantic reranking following the gist reranking pattern
- Add HyDE (Hypothetical Document Embedding) query preprocessing for better recall on short/Chinese/ambiguous queries, with 5s timeout and AbortSignal cleanup
- Restructure buildEmbeddingText with field labels (Repository:, Description:, Topics:, etc.) and dedup logic
- Add lightweight keyword boost for vector search results (name/description/tag exact match bonus)
- Make search threshold, topK, HyDE toggle, and reranking toggle user-configurable in VectorSearchSettings UI
- Add EMBEDDING_FORMAT_VERSION tracking to force re-index when embedding text format changes
- Fix greedy regex in both gist and semantic reranking JSON parsing (.* → .*?)
- Fix variable shadowing of translation function t in keyword boost lambdas
- Bump default threshold from 0.3 to 0.35

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bcb8b310-9e9c-488d-81d6-09ee0980d335

📥 Commits

Reviewing files that changed from the base of the PR and between 219b852 and 6dbfb7f.

📒 Files selected for processing (4)
  • src/components/SearchBar.tsx
  • src/components/settings/VectorSearchSettings.tsx
  • src/services/aiService.ts
  • src/services/vectorSearchService.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/services/aiService.ts
  • src/components/SearchBar.tsx
  • src/services/vectorSearchService.ts
  • src/components/settings/VectorSearchSettings.tsx

📝 Walkthrough

Walkthrough

Adds HyDE preprocessing and semantic reranking to AI search, introduces embedding format version 2 with forced reindexing support, exposes new vector search controls in settings, and raises the default similarity threshold to 0.35 in the worker and search service.

Changes

Vector Search Enhancements

Layer / File(s) Summary
VectorSearchConfig type and store migration
src/types/index.ts, src/store/useAppStore.ts
VectorSearchConfig gains searchThreshold, searchTopK, enableHyDE, enableReranking, embeddingFormatVersion; store migration backfills these fields with defaults for existing persisted state.
Embedding text format v2 and threshold defaults
src/services/vectorSearchService.ts, cloudflare-worker/src/index.ts
buildEmbeddingText is rewritten to a labeled embedding format with EMBEDDING_FORMAT_VERSION=2; default similarity threshold is raised to 0.35 in both the service and the worker.
Incremental indexing with format version detection
src/services/vectorSearchService.ts
indexAllRepos options add formatVersion and currentFormatVersion; incremental mode forces reindexing when the stored format version is older than the current one.
New AIService methods: HyDE and semantic reranking
src/services/aiService.ts
Adds generateHyDEQuery and searchRepositoriesWithSemanticReranking, and changes searchGistsWithReranking to extract the first bracketed JSON array non-greedily.
SearchBar: HyDE, keyword boosting, semantic reranking
src/components/SearchBar.tsx
Vector search now can generate a HyDE query, applies metadata keyword score boosting, uses configurable topK/threshold, and conditionally calls semantic reranking with fallback behavior.
VectorSearchSettings: search parameters UI and format version wiring
src/components/settings/VectorSearchSettings.tsx
Adds the Search Parameters section with threshold, topK, HyDE, and reranking controls; persists embeddingFormatVersion; passes format version data into indexing; updates disabled states and section numbering.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • AmintaCCCP/GithubStarsManager#228: Introduced the Vectorize worker query endpoint and vector search service behavior that this PR adjusts with a higher default threshold.
  • AmintaCCCP/GithubStarsManager#230: Added incremental indexing via vector_indexed_at and indexAllRepos incremental flow that this PR extends with format-version-based reindexing.

Poem

🐇 I hop through queries, bright and new,
With HyDE dreams and reranks too.
Labeled embeddings, version two,
Find the stars that shine right through.
Thresholds hum at thirty-five,
And bunny-search feels quite alive!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main vector search improvements, including HyDE, semantic reranking, and structured embeddings.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/optimize-vector-search

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces advanced search enhancements, including HyDE (Hypothetical Document Embedding) query preprocessing, lightweight keyword boosting, and LLM-based semantic reranking, along with a structured embedding text format and new search configuration settings. The review feedback highlights three key issues: first, the embeddingFormatVersion is not updated in the store after indexing, which breaks incremental indexing by always forcing a full rebuild; second, a potential unhandled promise rejection could occur if the HyDE query is aborted after timing out; and third, a defensive check is needed for r.metadata during keyword boosting to prevent potential runtime crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +366 to +367
formatVersion: vectorSearchConfig.embeddingFormatVersion,
currentFormatVersion: EMBEDDING_FORMAT_VERSION,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

问题:handleRebuildIndexhandleIncrementalIndex 成功执行后,Store 中的 embeddingFormatVersion 并没有被更新为最新的 EMBEDDING_FORMAT_VERSION (2)。

后果: 这会导致 vectorSearchConfig.embeddingFormatVersion 始终保持旧版本值(例如 1)。因此,下一次用户执行增量索引时,系统依然会检测到版本不一致(options.formatVersion < options.currentFormatVersion 始终为真),从而被迫对所有仓库进行全量重新索引。这完全使增量索引功能失效,并浪费了大量的 API Token 和时间。

修复建议:handleRebuildIndexhandleIncrementalIndex 的成功回调中,调用 setVectorSearchConfig({ embeddingFormatVersion: EMBEDDING_FORMAT_VERSION }) 来更新 Store 中的版本号。同时,记得将 setVectorSearchConfig 添加到这两个 useCallback 的依赖项数组中。

Comment on lines +552 to +560
embeddingQuery = await Promise.race([
hydeService.generateHyDEQuery(searchQuery, hydeAbort.signal),
new Promise<string>((resolve) => {
hydeTimer = setTimeout(() => {
hydeAbort.abort();
resolve(searchQuery);
}, 5000);
}),
]);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

如果超时触发,hydeAbort.abort() 会被调用,这会导致底层的 generateHyDEQuery 请求被中止并抛出 AbortError。由于此时 Promise.race 已经因超时而 resolve,这个异步抛出的 AbortError 将无法被外层的 try-catch 捕获,从而在浏览器中触发“未捕获的 Promise 拒绝”(Unhandled Promise Rejection)警告。

为了优雅地解决这个问题,可以直接在 generateHyDEQuery 后面附加 .catch(() => searchQuery)。这样无论是正常失败还是被中止,它都会安全地回退到原始的 searchQuery,避免任何未捕获的异常。

Suggested change
embeddingQuery = await Promise.race([
hydeService.generateHyDEQuery(searchQuery, hydeAbort.signal),
new Promise<string>((resolve) => {
hydeTimer = setTimeout(() => {
hydeAbort.abort();
resolve(searchQuery);
}, 5000);
}),
]);
embeddingQuery = await Promise.race([
hydeService.generateHyDEQuery(searchQuery, hydeAbort.signal).catch(() => searchQuery),
new Promise<string>((resolve) => {
hydeTimer = setTimeout(() => {
hydeAbort.abort();
resolve(searchQuery);
}, 5000);
}),
]);

Comment on lines +586 to +595
const boostedResults = vectorResults.map(r => {
let bonus = 0;
const name = (r.metadata.full_name || '').toLowerCase();
const desc = (r.metadata.description || '').toLowerCase();
const tags = (r.metadata.tags || []).map(tag => tag.toLowerCase());
if (name.includes(queryLower)) bonus += 0.05;
if (desc.includes(queryLower)) bonus += 0.03;
if (tags.some(tag => tag.includes(queryLower))) bonus += 0.02;
return { ...r, score: r.score + bonus };
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

在对向量搜索结果进行关键词加分时,直接访问了 r.metadata 的属性(如 r.metadata.full_name)。如果由于历史索引数据不一致或 Worker 响应异常导致 r.metadatanullundefined,这将会抛出 TypeError 并导致整个搜索功能崩溃。

建议在此处添加防御性代码,确保 r.metadata 存在时再进行属性访问。

Suggested change
const boostedResults = vectorResults.map(r => {
let bonus = 0;
const name = (r.metadata.full_name || '').toLowerCase();
const desc = (r.metadata.description || '').toLowerCase();
const tags = (r.metadata.tags || []).map(tag => tag.toLowerCase());
if (name.includes(queryLower)) bonus += 0.05;
if (desc.includes(queryLower)) bonus += 0.03;
if (tags.some(tag => tag.includes(queryLower))) bonus += 0.02;
return { ...r, score: r.score + bonus };
});
const boostedResults = vectorResults.map(r => {
if (!r.metadata) return r;
let bonus = 0;
const name = (r.metadata.full_name || '').toLowerCase();
const desc = (r.metadata.description || '').toLowerCase();
const tags = (r.metadata.tags || []).map(tag => tag.toLowerCase());
if (name.includes(queryLower)) bonus += 0.05;
if (desc.includes(queryLower)) bonus += 0.03;
if (tags.some(tag => tag.includes(queryLower))) bonus += 0.02;
return { ...r, score: r.score + bonus };
});

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/components/SearchBar.tsx (1)

609-632: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Preserve the LLM rerank order after filtering.

Line 627 calls applyFilters, which sorts by the active UI sort before rerankSucceeded is checked, so successful semantic reranking is effectively overwritten. Reapply an LLM order map after filtering, and cache that order for the replay path that currently sorts only by vector score.

Proposed fix direction
+                const rerankOrder = rerankSucceeded
+                  ? new Map(reranked.map((repo, index) => [String(repo.id), index]))
+                  : null;
                 const finalFiltered = applyFilters([...reranked]);
-                if (!rerankSucceeded) {
+                if (rerankOrder) {
+                  finalFiltered.sort(
+                    (a, b) => (rerankOrder.get(String(a.id)) ?? Number.MAX_SAFE_INTEGER)
+                      - (rerankOrder.get(String(b.id)) ?? Number.MAX_SAFE_INTEGER)
+                  );
+                } else {
                   finalFiltered.sort((a, b) => (scoreMap.get(String(b.id)) ?? 0) - (scoreMap.get(String(a.id)) ?? 0));
                 }

Also extend vectorScoreMapRef to store the rerank order and use it in the cached re-filter path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/components/SearchBar.tsx` around lines 609 - 632, Preserve the semantic
rerank order in SearchBar’s reranking flow: after
`AIService.searchRepositoriesWithSemanticReranking` succeeds,
`applyFilters([...reranked])` is re-sorting the results by the active UI sort
and wiping out the LLM order. Update the `SearchBar` search pipeline to retain a
rerank order map (built from the `reranked` array) and reapply that order after
`applyFilters`, while keeping the existing vector-score fallback only when
`rerankSucceeded` is false. Also extend `vectorScoreMapRef` to cache the rerank
order so the replay/cached re-filter path can restore the same LLM ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/components/settings/VectorSearchSettings.tsx`:
- Around line 366-367: The reindex flow in VectorSearchSettings is leaving
format-version upgrades unreachable and not updating the stored embedding
version after a successful run. Adjust the incremental/reindex gating logic so a
pure embedding-format change can trigger a one-shot reindex even when
unindexedCount is 0, and in the same success path persist embeddingFormatVersion
to EMBEDDING_FORMAT_VERSION alongside the existing VectorSearchConfig updates.
Focus the fix around the reindex trigger and save logic that uses
vectorSearchConfig and EMBEDDING_FORMAT_VERSION.
- Around line 948-1035: The new vector search controls need proper accessibility
wiring: in VectorSearchSettings, associate the “Similarity Threshold” and “Top
K” inputs with their labels using stable label/input linkage, and make the HyDE
and Reranking toggle buttons expose an accessible name plus switch state. Update
the controls around formSearchThreshold, formSearchTopK, formEnableHyDE, and
formEnableReranking so screen readers can announce each control’s purpose and
on/off status, using appropriate accessible attributes on the toggle buttons and
label associations for the inputs.

In `@src/services/aiService.ts`:
- Around line 656-686: The semantic reranking flow in
searchRepositoriesWithSemanticReranking should accept and forward an AbortSignal
so SearchBar can cancel or time out the LLM request. Update the method signature
to include signal, then pass that signal through to this.requestText alongside
system, user, temperature, and maxTokens. Make sure any call sites of
searchRepositoriesWithSemanticReranking are updated to supply the existing
signal from the search flow.

In `@src/services/vectorSearchService.ts`:
- Around line 510-512: The format version check in vectorSearchService should
treat absent stored versions as v1 so older vectors still get reindexed. Update
the format-change logic around formatVersionChanged to default a missing
options.formatVersion to 1 before comparing against
options.currentFormatVersion, while keeping the existing behavior when both
versions are present. Use the formatVersionChanged condition and the surrounding
reindex decision in vectorSearchService to locate and adjust the comparison.

---

Outside diff comments:
In `@src/components/SearchBar.tsx`:
- Around line 609-632: Preserve the semantic rerank order in SearchBar’s
reranking flow: after `AIService.searchRepositoriesWithSemanticReranking`
succeeds, `applyFilters([...reranked])` is re-sorting the results by the active
UI sort and wiping out the LLM order. Update the `SearchBar` search pipeline to
retain a rerank order map (built from the `reranked` array) and reapply that
order after `applyFilters`, while keeping the existing vector-score fallback
only when `rerankSucceeded` is false. Also extend `vectorScoreMapRef` to cache
the rerank order so the replay/cached re-filter path can restore the same LLM
ordering.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3629195f-1a26-44b1-8c0a-dadcd735bb48

📥 Commits

Reviewing files that changed from the base of the PR and between e937158 and 219b852.

📒 Files selected for processing (7)
  • cloudflare-worker/src/index.ts
  • src/components/SearchBar.tsx
  • src/components/settings/VectorSearchSettings.tsx
  • src/services/aiService.ts
  • src/services/vectorSearchService.ts
  • src/store/useAppStore.ts
  • src/types/index.ts

Comment thread src/components/settings/VectorSearchSettings.tsx
Comment thread src/components/settings/VectorSearchSettings.tsx
Comment thread src/services/aiService.ts Outdated
Comment thread src/services/vectorSearchService.ts Outdated
- Fix HyDE AbortError: catch abort on generateHyDEQuery to prevent unhandled rejection
- Fix r.metadata null safety: use optional chaining for keyword boost
- Fix formatVersionChanged: default missing version to v1 so older vectors get reindexed
- Fix applyFilters overwriting LLM rerank order: preserve rerankOrder map and reapply after filtering
- Fix signal forwarding: pass AbortSignal through semantic reranking to requestText
- Fix incremental index button: keep enabled when format version needs upgrade
- Fix embeddingFormatVersion not persisted: update store after both rebuild and incremental index
- Add accessibility: htmlFor/id for inputs, role/aria-checked for toggle switches

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@AmintaCCCP AmintaCCCP merged commit b9529a6 into main Jun 27, 2026
5 checks passed
@AmintaCCCP AmintaCCCP deleted the feature/optimize-vector-search branch June 27, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant