[2/3] Reduce SILK decode hot-path copies#115
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #115 +/- ##
==========================================
- Coverage 82.56% 82.41% -0.15%
==========================================
Files 22 22
Lines 4742 4828 +86
==========================================
+ Hits 3915 3979 +64
- Misses 635 654 +19
- Partials 192 195 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
20933b3 to
f861bef
Compare
There was a problem hiding this comment.
Pull request overview
This PR reduces scalar overhead and avoids unnecessary buffer copies/allocation in the Opus SILK decode hot path, aiming to improve throughput while keeping the decode algorithm behavior unchanged.
Changes:
- Avoid per-packet
[][]byteallocation for Code 0 packets by reusing a decoder-owned single-frame holder. - Reduce SILK resampling/copy staging by writing directly into caller output when safe, and add a mono resample fast path.
- Optimize hot loops in LPC synthesis and float32→s16 packing by reducing per-sample work and adding common-case fast paths.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
internal/silk/decoder.go |
Refactors LPC synthesis to normalize coefficients once per call and uses a steady-state history slice to reduce per-sample branching. |
internal/bitdepth/bitdepth.go |
Adds a resampleCount==1 fast path and replaces min/max clamp calls with branches for float→int16 conversion. |
decoder.go |
Reduces staging/copies in SILK decode output handling, adds mono resample shortcut, and reuses a single-frame slice for Code 0 packets. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f861bef to
7a0a95c
Compare
Summary
Reduce the remaining low-risk scalar overhead in the SILK decode hot path. This is the follow-up to #114, now rebased onto current
main.Major changes
Packet and output staging
[1][]byteinstead of building a fresh[][]bytethroughparsePacketFrames.decodeToFloat32, write decoded/resampled SILK output directly into the caller buffer when the decoded channel layout already matches and there are no SILK/CELT redundancy fades to apply. The old path always staged throughresampleBufferand copied out afterward.Float32 to s16 packing
math.Min/math.Maxclamping inFloat32ToSigned16with simple branches.resampleCount == 1fast path inConvertFloat32LittleEndianToSigned16LittleEndian, avoiding the general nested-loop path for the common case here.LPC synthesis
previousFrameLPCValueshandoff out of the per-sample hot loop.Why
After #114 removed most repeated SILK scratch allocations, the remaining local stress cost was mostly scalar copy and output staging work around the decode path.
This PR keeps the decode algorithm unchanged and removes avoidable hot-path movement between temporary buffers.
Validation
Ran:
GOCACHE=/private/tmp/opus-go-build GOLANGCI_LINT_CACHE=/private/tmp/opus-golangci-lint golangci-lint run GOCACHE=/private/tmp/opus-go-build go test ./...End-to-end stress benchmark only, no focused microbenchmarks:
Observed locally on Apple M4 Max,
darwin/arm64:mainafter #114