Skip to content

docs(examples): add AP inference validation demo#2

Closed
khan-u wants to merge 6 commits into
body-axis-ap-inferencefrom
ap-inference-validation-demo
Closed

docs(examples): add AP inference validation demo#2
khan-u wants to merge 6 commits into
body-axis-ap-inferencefrom
ap-inference-validation-demo

Conversation

@khan-u

@khan-u khan-u commented Apr 2, 2026

Copy link
Copy Markdown
Owner

Description

AP Inference Validation Script

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?

neuroinformatics-unit#945 introduced the AP inference pipeline (_validate_ap) inside collective.py. This PR adds a companion script that provides empirical evidence for when the prior-free inference succeeds and where it has limitations, tested against hand-curated ground truth across 5 datasets (4 species).

What does this PR do?

compute_polarization_AP_inference.py - runs _validate_ap on 5 multi-animal SLEAP datasets (2 flies, 2 mice, 4 gerbils, 5 mice, 2 bees), compares inferred AP ordering against ground truth, stores results in HDF5, and generates figures.


Workflow

The script operates in three passes:

Pass 1 - R×M Selection. Run _validate_ap once per individual per file. R×M is independent of the input node pair (it uses all-keypoint bbox centroids, centered skeletons, SVD, and velocity projections). Select the best individual per file by max R×M.

Pass 2 - Cross-Individual Ordering Consistency. Before this pass runs, compute_pc1_orderings projects every individual's GT nodes onto that individual's raw PC1 vector (without anterior_sign correction) and ranks them by descending projection. These raw orderings are then compared against the best individual's ordering (strict list equality). This is a consistency check, not a correctness check - it confirms body shape stability across individuals but says nothing about whether the ordering is anatomically correct.

Pass 3 - Inferred AP Concordance. For each individual, project GT nodes onto the inferred AP axis (anterior_sign × PC1). Test all C(n,2) GT node pairs: concordant if relative ordering matches GT. This tests the full pipeline per individual.

Reporting. Read back the H5 file to log the 3-step filter cascade progression (nodes/pairs surviving each step), GT coverage (how many GT nodes survive the lateral filter), and suggested pair analysis (which pair the cascade auto-selects). Since the GT dict contains only a small manually curated subset of nodes, the pipeline often selects nodes outside it - this is expected, not a failure. Cascade stats feed Figure 2.

All results are saved to a timestamped HDF5 file with per-individual nested groups for skeletons, PC1 vectors, velocity projections, concordance, and R×M values.


Figures

Cross-Dataset Comparison

Cross-dataset validation summary

Top row: Average skeletons (best individual per file) with PC1 axis, AP midline, and suggested pair arrow. Arrow color indicates GT membership. Shows whether the pipeline's auto-selected pair makes anatomical sense for each skeleton topology.
Bottom-left: GT node rankings per dataset - the reference standard against which Pass 3 concordance is measured.
Bottom-right: 3-step filter cascade - candidate pair counts at each step, monotonically decreasing. Reveals where each skeleton topology loses candidates (laterally spread skeletons lose at Step 1, one-sided skeletons at Step 2).

Per-File Detail (Best R×M Individual)

A 2×2 detail view for each dataset's best individual. Each tile shows the same average centered skeleton; what differs is the geometric annotation overlaid.

  • Tile 1 - Longitudinal Spread (PC1 Projections): Each keypoint's projection onto PC1. Shows whether GT nodes are well-separated along the body axis - poor separation here is the root cause of Factor 1 failures.
  • Tile 2 - Lateral Spread (PC2 Projections): Each keypoint's projection onto PC2. Shows which nodes sit far off-axis and will be removed by Step 1's lateral filter.
  • Tile 3 - Inferred AP Direction (Velocity Voting): Velocity-projection histogram (blue for +PC1, red for -PC1) with circular mean arrow V_c and bidirectional A↔P arrow. Shows the strength and direction of the velocity signal driving anterior_sign - the Factor 2 diagnostic.
  • Tile 4 - R×M vs Inferred AP Concordance: Scatter of R×M (x) against GT concordance (y) across all individuals in the file. The key diagnostic: shows whether R×M predicts concordance and exposes individuals where it doesn't (Factor 1 failures). Best individual marked with ★.

4Gerbils - pup unshaved (R×M=0.24, 100%). Most informative: 4 individuals with mixed accuracy (100%, 73.3%, 73.3%, 100%).

4Gerbils detail

2Bees - track_1 (R×M=0.21, 100%)

track_0 scores 0% while track_1 scores 100%.

2Bees detail

5Mice - track_0 (R×M=0.84, 100%)

Strongest velocity signal. All 5 individuals achieve 100%.

5Mice detail

2Mice - track_0 (R×M=0.04, 100%)

2Mice detail

2Flies - track_0 (R×M=0.02, 100%)

2Flies detail


Usage

python compute_polarization_AP_inference.py
# Output structure:
# datasets/multi-animal/exports/AP-inference-demo/
# ├── h5/
# │   └── ap_validation_20260402_053520.h5
# ├── figures/
# │   ├── ap_validation_results_20260402_053520.svg
# │   ├── free-moving-2mice..._track_0_20260402_053520.svg
# │   └── ...
# └── logs/
#     └── ap_validation_20260402_053506.log

Full validation log

Key configuration:

GROUND_TRUTH = {          # hand-curated GT rankings per dataset
    "free-moving-2flies...": {0: 3, 1: 2, 2: 1},  # node_idx: rank
    ...                   # higher rank = more anterior
}

How has this PR been tested?

Tested on 5 datasets (4 species, 5–21 keypoints, 2–5 individuals per file), ~1–2 minute recordings at 25 fps.

Pass 2: Cross-individual agreement ranges from 50% (4Gerbils: 2/4) to 100% (2Flies, 2Bees, 2Mice: 2/2). Minor run-to-run variation for borderline individuals (e.g. 5Mice: 3–4/5, likely k-medoids non-determinism).

Pass 3: Best individual (max R×M) achieves 100% in every file. Sub-100% individuals reveal two independent failure factors:

Dataset Individual Accuracy R M R×M
4Gerbils female 100% 0.05 0.08 0.00
4Gerbils male 73.3% 0.17 0.10 0.02
4Gerbils pup shaved 73.3% 0.30 0.06 0.02
4Gerbils pup unshaved 100% 0.39 0.62 0.24
2Bees track_0 0% 0.07 0.06 0.00
2Bees track_1 100% 0.41 0.50 0.21

Factor 1 - Skeleton geometry. Some GT node pairs don't separate cleanly along PC1 for a given individual's body shape. The 4Gerbils male/pup shaved score 73.3% (11/15 correct - sign is right, 4 pairs geometrically scrambled). The female scores 100% despite R×M ≈ 0 because her geometry happens to be clean. This is a fundamental limitation: no amount of better velocity data can fix pairs that don't separate along the principal axis.

Factor 2 - Velocity vote correctness. A wrong anterior_sign flips all pairs (100% → 0%). The 2Bees track_0 (0%, R×M ≈ 0) is this failure mode.

R×M predicts Factor 2 only. It selects the individual most likely to have a correct vote, but is blind to Factor 1. In every file, the max-R×M individual achieves 100%.

Next steps:

  • Longer videos would provide more high-motion frames and a more demanding test (~1–2 min is the current test data).
  • A Factor 1/2 decomposition (concordance on unsigned PC1, take max of both signs) would distinguish geometry failures from sign-flip failures programmatically. This could feed into Tile 4 as distinct marker shapes - making it visually clear why two individuals at similar R×M can have different concordance.

Summary. On these 5 short datasets, R×M selection finds an individual with 100% GT concordance in every file. Sub-100% cases follow two patterns - geometry-limited (73.3%, sign correct but some node pairs don't separate along PC1) and sign-flip (0%, insufficient velocity signal). The two-factor interpretation is consistent with the data but has not been verified programmatically. The validation script makes both the successes and the failure modes empirically visible.

References

Is this a breaking change?

No - standalone script, not a library API change.

Does this PR require an update to the documentation?

No - self-contained with inline documentation and console logs.

Checklist

  • Code tested locally
  • Empirical validation against 5-dataset (4-species) ground truth
  • Formatted with pre-commit

@khan-u khan-u force-pushed the body-axis-ap-inference branch from 01d16a8 to 51866c9 Compare April 4, 2026 05:21
@khan-u khan-u force-pushed the body-axis-ap-inference branch 2 times, most recently from 358e817 to b1df3b9 Compare April 5, 2026 09:55
@khan-u khan-u closed this Apr 5, 2026
@khan-u khan-u deleted the ap-inference-validation-demo branch April 5, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant