Skip to content

Conversation

tom21100227
Copy link
Contributor

When running inference using a bottom up model, Hungarian step in candidate matching was pulling scalars off-device inside the loop, we now copy a whole tensor once per sample.

Changes

Move the tensor host transfer for PAF candidate matching to a single .detach().cpu() per sample inside match_candidates_sample so SciPy’s Hungarian solver stops triggering per-element device syncs. This eliminates the millions of tiny kernel launches / memcpys that were gating throughput, while keeping outputs identical.

In sleap_nn.inference.paf_grouping.py:L570:

                if mask.any():
                    cost_matrix[i, j] = -line_scores_k[
                        mask
                    ].item() # <- very expensive, drawing a single scaler to memory from a tensor on GPU

Benchmark Results

  • Dataset: 30,002-frame 1024x1080 video at 50fps with bottom-up model (batch_size=8, queue_maxsize=16, tracking=True).
  • Wall-clock (progress bar): 22m 40s → 15m 30s (~37% speed-up).
  • CUDA API summary (Nsight Systems):
    • cudaLaunchKernel: 35,153,678 → 7,500,304 calls.
    • cudaMemcpyAsync: 11,544,619 → 1,132,888 calls.
  • Predictions checked against the baseline .slp – identical.

Before the patch:

Profiling summary
-----------------
frame_buffer.get: count=30002 avg=0.016 ms p95=0.025 ms
BatchPrepare: count=3751 avg=19.182 ms p95=22.216 ms
BatchAssemble: count=3751 avg=6.347 ms p95=6.949 ms
Postprocess: count=3751 avg=80.151 ms p95=93.482 ms
BottomUpInferenceModel.forward: count=3751 avg=238.494 ms p95=262.925 ms
PAFScorer.predict: count=3751 avg=148.015 ms p95=172.575 ms
Tracker.track: count=30002 avg=7.434 ms p95=10.082 ms
Predicted 30002 labeled frames

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls                 Name               
 --------  ---------------  ----------  ---------------------------------
     55.6  335,929,996,984  11,435,632  cudaStreamSynchronize            
     31.3  189,218,585,069  35,153,678  cudaLaunchKernel                 
     12.8   77,275,524,738  11,544,619  cudaMemcpyAsync   

After Patch

Profiling summary
-----------------
frame_buffer.get: count=30002 avg=0.016 ms p95=0.025 ms
BatchPrepare: count=3751 avg=18.768 ms p95=21.310 ms
BatchAssemble: count=3751 avg=6.178 ms p95=6.718 ms
Postprocess: count=3751 avg=81.858 ms p95=95.345 ms
BottomUpInferenceModel.forward: count=3751 avg=159.312 ms p95=166.948 ms <--
PAFScorer.predict: count=3751 avg=68.625 ms p95=76.426 ms <--
Tracker.track: count=30002 avg=7.728 ms p95=10.495 ms
Predicted 30002 labeled frames

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls                Name               
 --------  ---------------  ---------  ---------------------------------
     80.5  286,479,277,968  1,024,127  cudaStreamSynchronize            
     11.9   42,309,350,611  7,500,304  cudaLaunchKernel                 
      7.3   25,841,925,180  1,132,888  cudaMemcpyAsync                  

Attached is test script used to generate all the test, tagging functions for Nvidia Nsight profiler. test.py

To recreate benchmark:

apt update
apt install nsight-systems-2025.3.2
nsys profile -o sleap_inference python nsight_inference_example.py --model <path> --video data/subset.mp4 --batch-size 8 --queue-maxsize 16 --tracking

To view profile result:

nsys stats sleap_inference.nsys-rep

Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 80.95238% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.38%. Comparing base (ff91433) to head (968aa52).
⚠️ Report is 46 commits behind head on main.

Files with missing lines Patch % Lines
sleap_nn/inference/predictors.py 80.00% 16 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #342      +/-   ##
==========================================
- Coverage   95.28%   93.38%   -1.90%     
==========================================
  Files          49       49              
  Lines        6765     7079     +314     
==========================================
+ Hits         6446     6611     +165     
- Misses        319      468     +149     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant