feat(metadata): support server-side metadata search evals#10
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new “server-metadata-search” real-data metadata eval mode that compares baseline vs candidate AutoMem server code directly (without transforming restored data), while improving run metadata capture and reproducibility for worktree-based runs.
Changes:
- Extend
scripts/real_data_metadata_eval.shto support separate baseline/candidate AutoMem checkouts + Python envs, add runtime env-file allowlist loading, and introduce a server-side vector identity preflight path. - Add/extend unit + shell-level tests covering runtime env-file redaction, restore-plan rendering for separate checkouts, and report run-label propagation.
- Update the eval report generator to accept and emit a
run_labelin both markdown output and metrics JSON; update README workflow notes for worktrees.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/test_real_data_metadata_eval_shell.py | Adds integration-style shell tests for runtime env allowlist loading/redaction and server-variant restore-plan behavior. |
| scripts/real_data_metadata_eval.sh | Adds server-side eval variant support, baseline/candidate checkout + Python controls, allowlisted runtime env-file loading, and run labeling. |
| runners/test_run_metadata_ab_eval.py | Updates report-writing test coverage to validate run-label rendering. |
| runners/test_check_vector_identity.py | New unit tests for vector identity comparison + artifact writing. |
| runners/run_metadata_ab_eval.py | Adds --run-label CLI option and includes run label in report + metrics output. |
| runners/check_vector_identity.py | New runner that compares baseline/candidate Qdrant vectors and writes preflight/summary artifacts. |
| README.md | Updates worktree guidance to cover server-code variants and separate baseline/candidate AutoMem directories. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
jack-arturo
added a commit
to verygoodplugins/automem
that referenced
this pull request
Jun 10, 2026
## Summary - Add a bounded evidence-based metadata sidecar candidate channel to `/recall`, controlled by `RECALL_METADATA_SEARCH_ENABLED`. - Search whitelisted metadata values without replacing content vectors, generating production tags, mutating graph content, or changing the HTTP API. - Use field words only for scoring/disambiguation; this is not keyword-intent routing. - Document metadata storage, recall response shape, update/enrichment behavior, consolidation notes, and search behavior for issue #110. ## Breaking Changes None. ## Validation - `.venv/bin/python -m pytest -q` -> 465 passed, 12 skipped. - Real-data eval: `data/sweep_runs/20260610T174320Z-metadata-server-metadata-search`. - Metadata hit@5: 0.140 -> 0.770; MRR: 0.126 -> 0.672. - Qdrant vectors stayed byte-identical: 9,921 baseline and 9,921 candidate vectors. - Preserve/mixed suite had zero non-OK statuses. Eval harness PR: verygoodplugins/automem-evals#10 Closes #110 --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
server-metadata-searcheval support for comparing two AutoMem code checkouts without transforming restored data.AUTOMEM_RUNTIME_ENV_FILEloading so both Docker stacks share embedding runtime config without leaking unrelated env into Compose.Breaking Changes
None.
Validation
python3 -m unittest discover -s runners -p 'test_*.py'-> 186 tests OK, 1 skipped.python3 -m unittest scripts/test_real_data_metadata_eval_shell.py-> 6 tests OK.data/sweep_runs/20260610T174320Z-metadata-server-metadata-search.Related AutoMem PR: verygoodplugins/automem#177