Skip to content

[2/3] Reduce SILK decode hot-path copies#115

Merged
zshang-oai merged 1 commit into
pion:mainfrom
zshang-oai:codex/reduce-silk-decode-hot-path
May 21, 2026
Merged

[2/3] Reduce SILK decode hot-path copies#115
zshang-oai merged 1 commit into
pion:mainfrom
zshang-oai:codex/reduce-silk-decode-hot-path

Conversation

@zshang-oai
Copy link
Copy Markdown
Contributor

@zshang-oai zshang-oai commented May 18, 2026

Summary

Reduce the remaining low-risk scalar overhead in the SILK decode hot path. This is the follow-up to #114, now rebased onto current main.

Major changes

Packet and output staging

  • For Opus code-0 packets, reuse a decoder-owned [1][]byte instead of building a fresh [][]byte through parsePacketFrames.
  • For mono SILK resampling, skip the deinterleave/reinterleave scratch path and resample the already-contiguous samples directly.
  • In decodeToFloat32, write decoded/resampled SILK output directly into the caller buffer when the decoded channel layout already matches and there are no SILK/CELT redundancy fades to apply. The old path always staged through resampleBuffer and copied out afterward.

Float32 to s16 packing

  • Replace math.Min/math.Max clamping in Float32ToSigned16 with simple branches.
  • Add a resampleCount == 1 fast path in ConvertFloat32LittleEndianToSigned16LittleEndian, avoiding the general nested-loop path for the common case here.

LPC synthesis

  • Normalize LPC coefficients once per synthesis call instead of dividing inside the sample/coefficient loop.
  • Split first-subframe handling from steady-state subframes.
  • In the steady-state path, use contiguous LPC history slices to avoid repeated branchy “current subframe vs previous frame vs zero” lookups for every coefficient.
  • Move previousFrameLPCValues handoff out of the per-sample hot loop.

Why

After #114 removed most repeated SILK scratch allocations, the remaining local stress cost was mostly scalar copy and output staging work around the decode path.

This PR keeps the decode algorithm unchanged and removes avoidable hot-path movement between temporary buffers.

Validation

Ran:

GOCACHE=/private/tmp/opus-go-build GOLANGCI_LINT_CACHE=/private/tmp/opus-golangci-lint golangci-lint run
GOCACHE=/private/tmp/opus-go-build go test ./...

End-to-end stress benchmark only, no focused microbenchmarks:

go test -run '^$' -bench 'BenchmarkPionDecodeSerial$' -benchmem -benchtime=5s -count=1

Observed locally on Apple M4 Max, darwin/arm64:

Branch Throughput Allocations
main after #114 ~54.8k packets/s 4096 allocs/op
This PR ~95.4k packets/s 0 allocs/op

@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 86.33094% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.41%. Comparing base (f682a5f) to head (7a0a95c).

Files with missing lines Patch % Lines
decoder.go 78.37% 12 Missing and 4 partials ⚠️
internal/bitdepth/bitdepth.go 76.92% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #115      +/-   ##
==========================================
- Coverage   82.56%   82.41%   -0.15%     
==========================================
  Files          22       22              
  Lines        4742     4828      +86     
==========================================
+ Hits         3915     3979      +64     
- Misses        635      654      +19     
- Partials      192      195       +3     
Flag Coverage Δ
go 82.41% <86.33%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zshang-oai zshang-oai force-pushed the codex/reduce-silk-decode-hot-path branch 2 times, most recently from 20933b3 to f861bef Compare May 20, 2026 23:12
@zshang-oai zshang-oai marked this pull request as ready for review May 20, 2026 23:39
@zshang-oai zshang-oai requested review from JoTurk, Sean-Der and Copilot May 20, 2026 23:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces scalar overhead and avoids unnecessary buffer copies/allocation in the Opus SILK decode hot path, aiming to improve throughput while keeping the decode algorithm behavior unchanged.

Changes:

  • Avoid per-packet [][]byte allocation for Code 0 packets by reusing a decoder-owned single-frame holder.
  • Reduce SILK resampling/copy staging by writing directly into caller output when safe, and add a mono resample fast path.
  • Optimize hot loops in LPC synthesis and float32→s16 packing by reducing per-sample work and adding common-case fast paths.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
internal/silk/decoder.go Refactors LPC synthesis to normalize coefficients once per call and uses a steady-state history slice to reduce per-sample branching.
internal/bitdepth/bitdepth.go Adds a resampleCount==1 fast path and replaces min/max clamp calls with branches for float→int16 conversion.
decoder.go Reduces staging/copies in SILK decode output handling, adds mono resample shortcut, and reuses a single-frame slice for Code 0 packets.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread decoder.go
Comment thread decoder.go
@zshang-oai zshang-oai force-pushed the codex/reduce-silk-decode-hot-path branch from f861bef to 7a0a95c Compare May 21, 2026 03:25
@zshang-oai zshang-oai merged commit ee1d1de into pion:main May 21, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants