Significant performance increase in bottom up inference by reducing GPU to CPU data transfer. #342

tom21100227 · 2025-10-02T21:44:06Z

When running inference using a bottom up model, Hungarian step in candidate matching was pulling scalars off-device inside the loop, we now copy a whole tensor once per sample.

Changes

Move the tensor host transfer for PAF candidate matching to a single .detach().cpu() per sample inside match_candidates_sample so SciPy’s Hungarian solver stops triggering per-element device syncs. This eliminates the millions of tiny kernel launches / memcpys that were gating throughput, while keeping outputs identical.

In sleap_nn.inference.paf_grouping.py:L570:

                if mask.any():
                    cost_matrix[i, j] = -line_scores_k[
                        mask
                    ].item() # <- very expensive, drawing a single scaler to memory from a tensor on GPU

Benchmark Results

Dataset: 30,002-frame 1024x1080 video at 50fps with bottom-up model (batch_size=8, queue_maxsize=16, tracking=True).
Wall-clock (progress bar): 22m 40s → 15m 30s (~37% speed-up).
CUDA API summary (Nsight Systems):
- cudaLaunchKernel: 35,153,678 → 7,500,304 calls.
- cudaMemcpyAsync: 11,544,619 → 1,132,888 calls.
Predictions checked against the baseline .slp – identical.

Before the patch:

Profiling summary
-----------------
frame_buffer.get: count=30002 avg=0.016 ms p95=0.025 ms
BatchPrepare: count=3751 avg=19.182 ms p95=22.216 ms
BatchAssemble: count=3751 avg=6.347 ms p95=6.949 ms
Postprocess: count=3751 avg=80.151 ms p95=93.482 ms
BottomUpInferenceModel.forward: count=3751 avg=238.494 ms p95=262.925 ms
PAFScorer.predict: count=3751 avg=148.015 ms p95=172.575 ms
Tracker.track: count=30002 avg=7.434 ms p95=10.082 ms
Predicted 30002 labeled frames

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls                 Name               
 --------  ---------------  ----------  ---------------------------------
     55.6  335,929,996,984  11,435,632  cudaStreamSynchronize            
     31.3  189,218,585,069  35,153,678  cudaLaunchKernel                 
     12.8   77,275,524,738  11,544,619  cudaMemcpyAsync

After Patch

Profiling summary
-----------------
frame_buffer.get: count=30002 avg=0.016 ms p95=0.025 ms
BatchPrepare: count=3751 avg=18.768 ms p95=21.310 ms
BatchAssemble: count=3751 avg=6.178 ms p95=6.718 ms
Postprocess: count=3751 avg=81.858 ms p95=95.345 ms
BottomUpInferenceModel.forward: count=3751 avg=159.312 ms p95=166.948 ms <--
PAFScorer.predict: count=3751 avg=68.625 ms p95=76.426 ms <--
Tracker.track: count=30002 avg=7.728 ms p95=10.495 ms
Predicted 30002 labeled frames

 ** CUDA API Summary (cuda_api_sum):

 Time (%)  Total Time (ns)  Num Calls                Name               
 --------  ---------------  ---------  ---------------------------------
     80.5  286,479,277,968  1,024,127  cudaStreamSynchronize            
     11.9   42,309,350,611  7,500,304  cudaLaunchKernel                 
      7.3   25,841,925,180  1,132,888  cudaMemcpyAsync

Attached is test script used to generate all the test, tagging functions for Nvidia Nsight profiler. test.py

To recreate benchmark:

apt update
apt install nsight-systems-2025.3.2
nsys profile -o sleap_inference python nsight_inference_example.py --model <path> --video data/subset.mp4 --batch-size 8 --queue-maxsize 16 --tracking

To view profile result:

nsys stats sleap_inference.nsys-rep

codecov · 2025-10-02T21:48:09Z

Codecov Report

❌ Patch coverage is 80.95238% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.38%. Comparing base (ff91433) to head (968aa52).
⚠️ Report is 46 commits behind head on main.

Files with missing lines	Patch %	Lines
sleap_nn/inference/predictors.py	80.00%	16 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #342      +/-   ##
==========================================
- Coverage   95.28%   93.38%   -1.90%     
==========================================
  Files          49       49              
  Lines        6765     7079     +314     
==========================================
+ Hits         6446     6611     +165     
- Misses        319      468     +149

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

decouple paf

332346b

add more staging

968aa52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant performance increase in bottom up inference by reducing GPU to CPU data transfer. #342

Significant performance increase in bottom up inference by reducing GPU to CPU data transfer. #342

Uh oh!

tom21100227 commented Oct 2, 2025

Uh oh!

codecov bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Significant performance increase in bottom up inference by reducing GPU to CPU data transfer. #342

Are you sure you want to change the base?

Significant performance increase in bottom up inference by reducing GPU to CPU data transfer. #342

Uh oh!

Conversation

tom21100227 commented Oct 2, 2025

Changes

Benchmark Results

Uh oh!

codecov bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Oct 2, 2025 •

edited

Loading