Skip to content

feat(metadata): support server-side metadata search evals#10

Merged
jack-arturo merged 3 commits into
mainfrom
feat/metadata-signal-diagnosis
Jun 10, 2026
Merged

feat(metadata): support server-side metadata search evals#10
jack-arturo merged 3 commits into
mainfrom
feat/metadata-signal-diagnosis

Conversation

@jack-arturo

Copy link
Copy Markdown
Member

Summary

  • Add server-metadata-search eval support for comparing two AutoMem code checkouts without transforming restored data.
  • Add baseline/candidate AutoMem dir and Python env controls, vector identity checks, and a metadata sidecar run label.
  • Add filtered AUTOMEM_RUNTIME_ENV_FILE loading so both Docker stacks share embedding runtime config without leaking unrelated env into Compose.
  • Update tests and README workflow for server-side metadata search evals.

Breaking Changes

None.

Validation

  • python3 -m unittest discover -s runners -p 'test_*.py' -> 186 tests OK, 1 skipped.
  • python3 -m unittest scripts/test_real_data_metadata_eval_shell.py -> 6 tests OK.
  • Real-data eval: data/sweep_runs/20260610T174320Z-metadata-server-metadata-search.
  • Metadata hit@5: 0.140 -> 0.770; MRR: 0.126 -> 0.672.
  • Vector identity passed: 9,921 baseline and 9,921 candidate vectors were byte-identical.
  • Preserve/mixed suite had zero non-OK statuses.

Related AutoMem PR: verygoodplugins/automem#177

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “server-metadata-search” real-data metadata eval mode that compares baseline vs candidate AutoMem server code directly (without transforming restored data), while improving run metadata capture and reproducibility for worktree-based runs.

Changes:

  • Extend scripts/real_data_metadata_eval.sh to support separate baseline/candidate AutoMem checkouts + Python envs, add runtime env-file allowlist loading, and introduce a server-side vector identity preflight path.
  • Add/extend unit + shell-level tests covering runtime env-file redaction, restore-plan rendering for separate checkouts, and report run-label propagation.
  • Update the eval report generator to accept and emit a run_label in both markdown output and metrics JSON; update README workflow notes for worktrees.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/test_real_data_metadata_eval_shell.py Adds integration-style shell tests for runtime env allowlist loading/redaction and server-variant restore-plan behavior.
scripts/real_data_metadata_eval.sh Adds server-side eval variant support, baseline/candidate checkout + Python controls, allowlisted runtime env-file loading, and run labeling.
runners/test_run_metadata_ab_eval.py Updates report-writing test coverage to validate run-label rendering.
runners/test_check_vector_identity.py New unit tests for vector identity comparison + artifact writing.
runners/run_metadata_ab_eval.py Adds --run-label CLI option and includes run label in report + metrics output.
runners/check_vector_identity.py New runner that compares baseline/candidate Qdrant vectors and writes preflight/summary artifacts.
README.md Updates worktree guidance to cover server-code variants and separate baseline/candidate AutoMem directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/real_data_metadata_eval.sh Outdated
Comment thread README.md Outdated
jack-arturo and others added 2 commits June 10, 2026 20:47
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@jack-arturo jack-arturo merged commit c56fda2 into main Jun 10, 2026
@jack-arturo jack-arturo deleted the feat/metadata-signal-diagnosis branch June 10, 2026 18:55
jack-arturo added a commit to verygoodplugins/automem that referenced this pull request Jun 10, 2026
## Summary
- Add a bounded evidence-based metadata sidecar candidate channel to
`/recall`, controlled by `RECALL_METADATA_SEARCH_ENABLED`.
- Search whitelisted metadata values without replacing content vectors,
generating production tags, mutating graph content, or changing the HTTP
API.
- Use field words only for scoring/disambiguation; this is not
keyword-intent routing.
- Document metadata storage, recall response shape, update/enrichment
behavior, consolidation notes, and search behavior for issue #110.

## Breaking Changes
None.

## Validation
- `.venv/bin/python -m pytest -q` -> 465 passed, 12 skipped.
- Real-data eval:
`data/sweep_runs/20260610T174320Z-metadata-server-metadata-search`.
- Metadata hit@5: 0.140 -> 0.770; MRR: 0.126 -> 0.672.
- Qdrant vectors stayed byte-identical: 9,921 baseline and 9,921
candidate vectors.
- Preserve/mixed suite had zero non-OK statuses.

Eval harness PR:
verygoodplugins/automem-evals#10

Closes #110

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants