docs(examples): add AP inference validation demo#2
Closed
khan-u wants to merge 6 commits into
Closed
Conversation
01d16a8 to
51866c9
Compare
358e817 to
b1df3b9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
AP Inference Validation Script
What is this PR
Why is this PR needed?
neuroinformatics-unit#945 introduced the AP inference pipeline (
_validate_ap) insidecollective.py. This PR adds a companion script that provides empirical evidence for when the prior-free inference succeeds and where it has limitations, tested against hand-curated ground truth across 5 datasets (4 species).What does this PR do?
compute_polarization_AP_inference.py- runs_validate_apon 5 multi-animal SLEAP datasets (2 flies, 2 mice, 4 gerbils, 5 mice, 2 bees), compares inferred AP ordering against ground truth, stores results in HDF5, and generates figures.Workflow
The script operates in three passes:
Pass 1 - R×M Selection. Run
_validate_aponce per individual per file. R×M is independent of the input node pair (it uses all-keypoint bbox centroids, centered skeletons, SVD, and velocity projections). Select the best individual per file by max R×M.Pass 2 - Cross-Individual Ordering Consistency. Before this pass runs,
compute_pc1_orderingsprojects every individual's GT nodes onto that individual's raw PC1 vector (withoutanterior_signcorrection) and ranks them by descending projection. These raw orderings are then compared against the best individual's ordering (strict list equality). This is a consistency check, not a correctness check - it confirms body shape stability across individuals but says nothing about whether the ordering is anatomically correct.Pass 3 - Inferred AP Concordance. For each individual, project GT nodes onto the inferred AP axis (
anterior_sign × PC1). Test all C(n,2) GT node pairs: concordant if relative ordering matches GT. This tests the full pipeline per individual.Reporting. Read back the H5 file to log the 3-step filter cascade progression (nodes/pairs surviving each step), GT coverage (how many GT nodes survive the lateral filter), and suggested pair analysis (which pair the cascade auto-selects). Since the GT dict contains only a small manually curated subset of nodes, the pipeline often selects nodes outside it - this is expected, not a failure. Cascade stats feed Figure 2.
All results are saved to a timestamped HDF5 file with per-individual nested groups for skeletons, PC1 vectors, velocity projections, concordance, and R×M values.
Figures
Cross-Dataset Comparison
Top row: Average skeletons (best individual per file) with PC1 axis, AP midline, and suggested pair arrow. Arrow color indicates GT membership. Shows whether the pipeline's auto-selected pair makes anatomical sense for each skeleton topology.
Bottom-left: GT node rankings per dataset - the reference standard against which Pass 3 concordance is measured.
Bottom-right: 3-step filter cascade - candidate pair counts at each step, monotonically decreasing. Reveals where each skeleton topology loses candidates (laterally spread skeletons lose at Step 1, one-sided skeletons at Step 2).
Per-File Detail (Best R×M Individual)
A 2×2 detail view for each dataset's best individual. Each tile shows the same average centered skeleton; what differs is the geometric annotation overlaid.
anterior_sign- the Factor 2 diagnostic.4Gerbils - pup unshaved (R×M=0.24, 100%). Most informative: 4 individuals with mixed accuracy (100%, 73.3%, 73.3%, 100%).
2Bees - track_1 (R×M=0.21, 100%)
track_0 scores 0% while track_1 scores 100%.
5Mice - track_0 (R×M=0.84, 100%)
Strongest velocity signal. All 5 individuals achieve 100%.
2Mice - track_0 (R×M=0.04, 100%)
2Flies - track_0 (R×M=0.02, 100%)
Usage
Full validation log
Key configuration:
How has this PR been tested?
Tested on 5 datasets (4 species, 5–21 keypoints, 2–5 individuals per file), ~1–2 minute recordings at 25 fps.
Pass 2: Cross-individual agreement ranges from 50% (4Gerbils: 2/4) to 100% (2Flies, 2Bees, 2Mice: 2/2). Minor run-to-run variation for borderline individuals (e.g. 5Mice: 3–4/5, likely k-medoids non-determinism).
Pass 3: Best individual (max R×M) achieves 100% in every file. Sub-100% individuals reveal two independent failure factors:
Factor 1 - Skeleton geometry. Some GT node pairs don't separate cleanly along PC1 for a given individual's body shape. The 4Gerbils male/pup shaved score 73.3% (11/15 correct - sign is right, 4 pairs geometrically scrambled). The female scores 100% despite R×M ≈ 0 because her geometry happens to be clean. This is a fundamental limitation: no amount of better velocity data can fix pairs that don't separate along the principal axis.
Factor 2 - Velocity vote correctness. A wrong
anterior_signflips all pairs (100% → 0%). The 2Bees track_0 (0%, R×M ≈ 0) is this failure mode.R×M predicts Factor 2 only. It selects the individual most likely to have a correct vote, but is blind to Factor 1. In every file, the max-R×M individual achieves 100%.
Next steps:
Summary. On these 5 short datasets, R×M selection finds an individual with 100% GT concordance in every file. Sub-100% cases follow two patterns - geometry-limited (73.3%, sign correct but some node pairs don't separate along PC1) and sign-flip (0%, insufficient velocity signal). The two-factor interpretation is consistent with the data but has not been verified programmatically. The validation script makes both the successes and the failure modes empirically visible.
References
Is this a breaking change?
No - standalone script, not a library API change.
Does this PR require an update to the documentation?
No - self-contained with inline documentation and console logs.
Checklist