Skip to content

Conversation

@anantham
Copy link
Owner

Summary

Fixes the xfail test test_shadow_store_upsert_is_idempotent by using consistent canonical account IDs.

Root Cause

The test was failing because:

  • Legacy database contains duplicate usernames with different user_ids (e.g., user_id=8500962 and user_id="vgr" both have username="vgr")
  • Test created edges with inconsistent IDs (numeric user_id vs username)
  • Shadow store's _merge_duplicate_accounts correctly deduplicated but mutated edge source/target IDs
  • Second upsert saw different primary keys and created duplicate edges (19 instead of 10)

Solution

  • Added _canonical_account_id() helper: returns username if available, otherwise user_id
  • Built id_mapping from legacy user_id to canonical account_id
  • Applied mapping consistently when creating both account and edge records
  • Updated test assertions to expect deduplicated counts (accounting for duplicate usernames in legacy data)

Type of Change

  • Bug fix (test fix)
  • New feature
  • Breaking change
  • Technical debt reduction

Testing

  • All 4 tests now pass (previously 1 xfail)
  • Verified with debug scripts that deduplication logic works correctly
  • Confirmed legacy DB has 3 duplicate usernames (vgr, p_millerd, tkstanczak)
  • Both migration tests validate edge upsert idempotency correctly

Review Checklist

  • Code follows commit message format requirements
  • No direct commits to main - all changes via PR
  • Ready for Codex automated review

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

MOTIVATION:
- test_shadow_store_upsert_is_idempotent was marked as xfail
- Test was creating edges with inconsistent IDs (numeric vs username)
- Shadow store's _merge_duplicate_accounts was correctly deduplicating
  but mutating edge source/target IDs, breaking test assumptions
- Legacy database contains duplicate usernames with different user_ids
  (e.g., user_id=8500962 and user_id="vgr" both have username="vgr")

APPROACH:
- Use consistent canonical IDs: username if available, otherwise user_id
- Build id_mapping from legacy user_id to canonical account_id
- Apply mapping when creating both account and edge records
- Update test assertions to expect deduplicated counts

CHANGES:
- tests/test_shadow_store_migration.py: Add _canonical_account_id helper
- tests/test_shadow_store_migration.py: Update both tests to use id_mapping
- tests/test_shadow_store_migration.py: Fix assertions to expect unique counts

IMPACT:
- All tests now pass (4 passed, no xfail)
- Tests correctly validate edge upsert idempotency
- Tests work with legacy data containing duplicate usernames
- Removed xfail marker - issue was test expectations, not code

TESTING:
- Verified with debug scripts that deduplication logic works correctly
- Confirmed legacy DB has 3 duplicate usernames (vgr, p_millerd, tkstanczak)
- Both migration tests pass with consistent ID usage
- All other tests still pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +189 to +192
# Idempotency check: second upsert should not change counts
assert len(accounts_after_first) == expected_account_count
assert len(accounts_after_second) == expected_account_count
assert len(edges_after_first) == len(edges_after_second)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Idempotency test no longer validates edge deduplication

The revised test_shadow_store_upsert_is_idempotent only compares the edge count between the first and second fetch. If the very first upsert_edges already creates duplicates (e.g., inserting 19 edges when only 10 unique edges exist), both edges_after_first and edges_after_second will have the same inflated length and the test will still pass. This means the regression it was written to catch—duplicate edges in the shadow store—can slip through as long as the duplication happens on the initial insert instead of the second. Please assert the count after the first upsert matches the deduplicated expectation (for example, the number of unique (source_id, target_id, direction) records) so the test fails whenever duplicates exist at all.

Useful? React with 👍 / 👎.

MOTIVATION:
- Codex review identified incomplete validation in test_shadow_store_upsert_is_idempotent
- Test only compared edge count between first and second upsert
- If first upsert creates duplicates (19 edges instead of 10), both counts are 19 and test passes
- Regression (duplicate edges in shadow store) can slip through undetected

APPROACH:
- Calculate expected_edge_count from unique (source_id, target_id, direction) tuples
- Use same id_mapping logic as edge_records creation for consistency
- Add assertion after first upsert: len(edges_after_first) == expected_edge_count
- Include descriptive error message showing expected vs actual counts
- This catches duplicates immediately, whether on first or second insert

CHANGES:
- tests/test_shadow_store_migration.py:142-149:
  - Added expected_edge_count calculation using set of unique edge tuples
  - Iterates through legacy_edges with same transformations as edge_records

- tests/test_shadow_store_migration.py:198-202:
  - Added deduplication assertion before idempotency check
  - Validates first upsert creates exactly expected_edge_count edges
  - Descriptive error message for debugging if duplicates exist

IMPACT:
- Fixes P1 Codex review issue: "Idempotency test no longer validates edge deduplication"
- Test now catches duplicate edges regardless of when they're created
- No breaking changes - strengthens existing test coverage
- Provides clear error messages for debugging deduplication failures

TESTING:
- Syntax check passes: python3 -m py_compile
- Logic verified: uses same id_mapping and direction extraction as edge_records
- Error message includes both expected and actual counts for debugging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants