Skip to content

feat(tags): restore cross-subtree retrieval with legacy schema compatibility#1205

Open
13ernkastel wants to merge 4 commits intovolcengine:mainfrom
13ernkastel:codex/fix-tags-write-type
Open

feat(tags): restore cross-subtree retrieval with legacy schema compatibility#1205
13ernkastel wants to merge 4 commits intovolcengine:mainfrom
13ernkastel:codex/fix-tags-write-type

Conversation

@13ernkastel
Copy link
Copy Markdown
Contributor

@13ernkastel 13ernkastel commented Apr 3, 2026

Summary

  • restore tag-based cross-subtree retrieval on top of the latest main
  • reintroduce canonical tag parsing/normalization for resource indexing and query expansion
  • preserve compatibility with existing deployments whose context collections still store tags as string

Why

This PR re-lands the reverted retrieval feature against the latest codebase, following zhoujh01's recommendation here:
#1162 (comment)

The regression after #1162 came from a type mismatch on the write path: tags=[] / list[str] could reach a collection schema that still validated tags as a scalar string.

Compatibility / migration

  • newly created context collections now define tags as list<string>
  • existing live collections that still define tags as string remain writable because the upsert path serializes list tags before persistence
  • no manual migration is required for existing deployments

Checks

  • tests/storage/test_collection_schemas.py covers new-schema writes, legacy-string compatibility, empty-tag handling, and queue-handler-to-legacy-schema persistence
  • tests/storage/test_semantic_processor_lifecycle_lock.py
  • tests/utils/test_tag_utils.py
  • tests/retrieve/test_hierarchical_retriever_tags.py
  • tests/retrieve/test_hierarchical_retriever_rerank.py
  • tests/retrieve/test_hierarchical_retriever_target_dirs.py
  • tests/server/test_api_resources.py
  • tests/server/test_api_search.py -k forwards_tags
  • tests/unit/test_vectorize_file_strategy.py
  • tests/client/test_search.py -k forward

Notes

I could not run the full environment-dependent search execution tests locally because this machine's native vector engine build is missing PersistStore, and the broader client/integration flows also require extra local config artifacts such as ov.conf. To compensate, this PR adds regression coverage around the storage, retrieval, API forwarding, and client-wrapper boundaries that exercise the feature paths available in this environment.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 90
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add tag parsing/normalization utilities

Relevant files:

  • openviking/utils/tag_utils.py
  • tests/utils/test_tag_utils.py

Sub-PR theme: Update collection schema to list tags with legacy compatibility

Relevant files:

  • openviking/storage/collection_schemas.py
  • openviking/storage/viking_vector_index_backend.py
  • tests/storage/test_collection_schemas.py

Sub-PR theme: Add tag-based cross-subtree retrieval expansion

Relevant files:

  • openviking/retrieve/hierarchical_retriever.py
  • tests/retrieve/test_hierarchical_retriever_tags.py

⚡ Recommended focus areas for review

Missing License Header

New Python file in openviking/ directory lacks the required AGPL-3.0 license header.

"""Tag parsing and lightweight extraction helpers."""
Type Safety Issue

Dataclass field tags is annotated as List[str] but defaults to None, which violates type safety. parse_tags() always returns a List[str], so use a default empty list or Optional[List[str]].

tags: List[str] = None

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

@13ernkastel
Copy link
Copy Markdown
Contributor Author

Addressed the PR Reviewer Guide follow-ups in 4274fea:

  • added the required AGPL header to openviking/utils/tag_utils.py
  • fixed SemanticMsg.tags to use a type-safe field(default_factory=list) instead of List[str] = None
  • added a regression test to keep the missing-tags path normalized to []

Re-ran the focused tag/storage checks after the patch.

@13ernkastel 13ernkastel changed the title feat(retrieve): restore tags metadata for cross-subtree retrieval feat(tags): restore cross-subtree retrieval with legacy schema compatibility Apr 3, 2026
@13ernkastel 13ernkastel force-pushed the codex/fix-tags-write-type branch from 4274fea to 95b8daa Compare April 3, 2026 08:03
@13ernkastel
Copy link
Copy Markdown
Contributor Author

Latest update on the branch:

  • cleaned the PR history down to 3 logical commits
  • added a queue-handler -> legacy-string-schema regression in tests/storage/test_collection_schemas.py
  • re-ran the tag-focused feature suite listed in the PR body

Current branch head: 95b8daa

Local limitation remains the same: full search execution cases in tests/server/test_api_search.py still depend on a native vector engine build that exposes PersistStore, which is not available in this environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant