docs(examples): add AP inference validation demo by khan-u · Pull Request #2 · khan-u/movement

khan-u · 2026-04-02T13:39:20Z

Description

AP Inference Validation Script

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?

neuroinformatics-unit#945 introduced the AP inference pipeline (_validate_ap) inside collective.py. This PR adds a companion script that provides empirical evidence for when the prior-free inference succeeds and where it has limitations, tested against hand-curated ground truth across 5 datasets (4 species).

What does this PR do?

compute_polarization_AP_inference.py - runs _validate_ap on 5 multi-animal SLEAP datasets (2 flies, 2 mice, 4 gerbils, 5 mice, 2 bees), compares inferred AP ordering against ground truth, stores results in HDF5, and generates figures.

Workflow

The script operates in three passes:

Pass 1 - R×M Selection. Run _validate_ap once per individual per file. R×M is independent of the input node pair (it uses all-keypoint bbox centroids, centered skeletons, SVD, and velocity projections). Select the best individual per file by max R×M.

Pass 2 - Cross-Individual Ordering Consistency. Before this pass runs, compute_pc1_orderings projects every individual's GT nodes onto that individual's raw PC1 vector (without anterior_sign correction) and ranks them by descending projection. These raw orderings are then compared against the best individual's ordering (strict list equality). This is a consistency check, not a correctness check - it confirms body shape stability across individuals but says nothing about whether the ordering is anatomically correct.

Pass 3 - Inferred AP Concordance. For each individual, project GT nodes onto the inferred AP axis (anterior_sign × PC1). Test all C(n,2) GT node pairs: concordant if relative ordering matches GT. This tests the full pipeline per individual.

Reporting. Read back the H5 file to log the 3-step filter cascade progression (nodes/pairs surviving each step), GT coverage (how many GT nodes survive the lateral filter), and suggested pair analysis (which pair the cascade auto-selects). Since the GT dict contains only a small manually curated subset of nodes, the pipeline often selects nodes outside it - this is expected, not a failure. Cascade stats feed Figure 2.

All results are saved to a timestamped HDF5 file with per-individual nested groups for skeletons, PC1 vectors, velocity projections, concordance, and R×M values.

Figures

Cross-Dataset Comparison

Top row: Average skeletons (best individual per file) with PC1 axis, AP midline, and suggested pair arrow. Arrow color indicates GT membership. Shows whether the pipeline's auto-selected pair makes anatomical sense for each skeleton topology.
Bottom-left: GT node rankings per dataset - the reference standard against which Pass 3 concordance is measured.
Bottom-right: 3-step filter cascade - candidate pair counts at each step, monotonically decreasing. Reveals where each skeleton topology loses candidates (laterally spread skeletons lose at Step 1, one-sided skeletons at Step 2).

Per-File Detail (Best R×M Individual)

A 2×2 detail view for each dataset's best individual. Each tile shows the same average centered skeleton; what differs is the geometric annotation overlaid.

Tile 1 - Longitudinal Spread (PC1 Projections): Each keypoint's projection onto PC1. Shows whether GT nodes are well-separated along the body axis - poor separation here is the root cause of Factor 1 failures.
Tile 2 - Lateral Spread (PC2 Projections): Each keypoint's projection onto PC2. Shows which nodes sit far off-axis and will be removed by Step 1's lateral filter.
Tile 3 - Inferred AP Direction (Velocity Voting): Velocity-projection histogram (blue for +PC1, red for -PC1) with circular mean arrow V_c and bidirectional A↔P arrow. Shows the strength and direction of the velocity signal driving anterior_sign - the Factor 2 diagnostic.
Tile 4 - R×M vs Inferred AP Concordance: Scatter of R×M (x) against GT concordance (y) across all individuals in the file. The key diagnostic: shows whether R×M predicts concordance and exposes individuals where it doesn't (Factor 1 failures). Best individual marked with ★.

4Gerbils - pup unshaved (R×M=0.24, 100%). Most informative: 4 individuals with mixed accuracy (100%, 73.3%, 73.3%, 100%).

2Bees - track_1 (R×M=0.21, 100%)

track_0 scores 0% while track_1 scores 100%.

5Mice - track_0 (R×M=0.84, 100%)

Strongest velocity signal. All 5 individuals achieve 100%.

2Mice - track_0 (R×M=0.04, 100%)

2Flies - track_0 (R×M=0.02, 100%)

Usage

python compute_polarization_AP_inference.py

# Output structure:
# datasets/multi-animal/exports/AP-inference-demo/
# ├── h5/
# │   └── ap_validation_20260402_053520.h5
# ├── figures/
# │   ├── ap_validation_results_20260402_053520.svg
# │   ├── free-moving-2mice..._track_0_20260402_053520.svg
# │   └── ...
# └── logs/
#     └── ap_validation_20260402_053506.log

Full validation log

Key configuration:

GROUND_TRUTH = {          # hand-curated GT rankings per dataset
    "free-moving-2flies...": {0: 3, 1: 2, 2: 1},  # node_idx: rank
    ...                   # higher rank = more anterior
}

How has this PR been tested?

Tested on 5 datasets (4 species, 5–21 keypoints, 2–5 individuals per file), ~1–2 minute recordings at 25 fps.

Pass 2: Cross-individual agreement ranges from 50% (4Gerbils: 2/4) to 100% (2Flies, 2Bees, 2Mice: 2/2). Minor run-to-run variation for borderline individuals (e.g. 5Mice: 3–4/5, likely k-medoids non-determinism).

Pass 3: Best individual (max R×M) achieves 100% in every file. Sub-100% individuals reveal two independent failure factors:

Dataset	Individual	Accuracy	R	M	R×M
4Gerbils	female	100%	0.05	0.08	0.00
4Gerbils	male	73.3%	0.17	0.10	0.02
4Gerbils	pup shaved	73.3%	0.30	0.06	0.02
4Gerbils	pup unshaved	100%	0.39	0.62	0.24
2Bees	track_0	0%	0.07	0.06	0.00
2Bees	track_1	100%	0.41	0.50	0.21

Factor 1 - Skeleton geometry. Some GT node pairs don't separate cleanly along PC1 for a given individual's body shape. The 4Gerbils male/pup shaved score 73.3% (11/15 correct - sign is right, 4 pairs geometrically scrambled). The female scores 100% despite R×M ≈ 0 because her geometry happens to be clean. This is a fundamental limitation: no amount of better velocity data can fix pairs that don't separate along the principal axis.

Factor 2 - Velocity vote correctness. A wrong anterior_sign flips all pairs (100% → 0%). The 2Bees track_0 (0%, R×M ≈ 0) is this failure mode.

R×M predicts Factor 2 only. It selects the individual most likely to have a correct vote, but is blind to Factor 1. In every file, the max-R×M individual achieves 100%.

Next steps:

Longer videos would provide more high-motion frames and a more demanding test (~1–2 min is the current test data).
A Factor 1/2 decomposition (concordance on unsigned PC1, take max of both signs) would distinguish geometry failures from sign-flip failures programmatically. This could feed into Tile 4 as distinct marker shapes - making it visually clear why two individuals at similar R×M can have different concordance.

Summary. On these 5 short datasets, R×M selection finds an individual with 100% GT concordance in every file. Sub-100% cases follow two patterns - geometry-limited (73.3%, sign correct but some node pairs don't separate along PC1) and sign-flip (0%, insufficient velocity signal). The two-factor interpretation is consistent with the data but has not been verified programmatically. The validation script makes both the successes and the failure modes empirically visible.

References

Is this a breaking change?

No - standalone script, not a library API change.

Does this PR require an update to the documentation?

No - self-contained with inline documentation and console logs.

Checklist

Code tested locally
Empirical validation against 5-dataset (4-species) ground truth
Formatted with pre-commit

for more information, see https://pre-commit.ci

…s.py

for more information, see https://pre-commit.ci

khan-u and others added 4 commits April 2, 2026 03:01

feat(collective): add prior-free body-axis inference

01d16a8

docs(examples): add AP inference validation demo

2ca0464

[pre-commit.ci] auto fixes from pre-commit.com hooks

8f8d4d0

for more information, see https://pre-commit.ci

docs(examples): add AP inference validation demo

6b2f161

khan-u force-pushed the body-axis-ap-inference branch from 01d16a8 to 51866c9 Compare April 4, 2026 05:21

khan-u and others added 2 commits April 4, 2026 02:05

fix(examples): update demo post-refactor of AP validation to body_axi…

8ae0623

…s.py

[pre-commit.ci] auto fixes from pre-commit.com hooks

0177432

for more information, see https://pre-commit.ci

khan-u force-pushed the body-axis-ap-inference branch 2 times, most recently from 358e817 to b1df3b9 Compare April 5, 2026 09:55

khan-u closed this Apr 5, 2026

khan-u deleted the ap-inference-validation-demo branch April 5, 2026 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(examples): add AP inference validation demo#2

docs(examples): add AP inference validation demo#2
khan-u wants to merge 6 commits into
body-axis-ap-inferencefrom
ap-inference-validation-demo

khan-u commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khan-u commented Apr 2, 2026

Description

Workflow

Figures

Cross-Dataset Comparison

Per-File Detail (Best R×M Individual)

References

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant