Skip to content

Entity-node migration: gate people-entity word-pair noise before creating Entity nodes #181

@jack-arturo

Description

@jack-arturo

Context

During the #72 production repair rollout (2026-06-10), migrate_entity_nodes.py --dry-run against the repaired corpus accepted 1,591 entities: people 1,020, organizations 400, tools 118, projects 26, concepts 22, opportunities 5 (5,535 held back by the existing migration-review heuristics; 3,026 edges).

Migration was deferred at the rollout's final gate because the people category is still noisy:

  • 734 of 1,020 accepted people have exactly one reference, and a random sample of that bucket is ~80% word-pair noise with person shape: anker-power, cache-performance, tcp-proxy, framer-motion, keep-stripe, server-vad, mac-catalyst, …
  • Even refs≥2 leaks: google-calendar (8), sri-lanka (9), raspberry-pi (7), mile-method (7), loading-suredash (7), jack-s (8).

Proposed work

  1. Migration gates: add --min-references N and --categories a,b,c to scripts/migrate_entity_nodes.py (it currently has only --dry-run) so a clean slice (non-people + high-reference people) can migrate first.
  2. Word-pair people validation pass: structural detection of common-noun pairs that pass _has_person_name_shape, or the cheap-LLM review path for ambiguous people slugs (per the no-corpus-specific-fixtures direction).
  3. Verb-person fragments survive repair as tags (needs-jack, jack-mentions, demonstrates-steve class) — leading verbs missing from _ACTION_PREFIXES; the same pass should handle them.
  4. 8 noise canonical targets consolidated by canonicalize-safe (solana-agent, locomo-benchmark, rag-implementation, decap-cms, loading-suredash, mile-method, watapana-tattoo, +1) — single slugs now, easy to remove once classified.

Full dry-run log: automem-evals data/sweep_runs/prod-rollout-20260610/migration-dry-run.log.

Refs #72, #124, #176, #178, #179.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions