Faster BoolReader #124

SLiV9 · 2024-12-28T18:06:39Z

Moved BoolReader to its own file.
It now reads from the buffer in chunks of 4 bytes at a time, except for the final 0-3 bytes.
Optimize successive calls to read_bool and read_with_tree by assuming none of them reach the end of the buffer and returning a transparent BitResult, then validating at the end.
Optimize each individual call to read_bool and read_with_tree by assuming each bit can be read from the 4-byte chunks (in FastReader), and retrying with the slow approach if this fails.

Final performance results are a 1.3x speedup compared to image-rs 0.2.0 (--use-reference), although it is still 1.3x slower than libwebp:

Summary
'dwebp -noasm -nofancy Puente.webp' ran
1.00 ± 0.01 times faster than 'dwebp -noasm -nofancy Puente.webp'
1.31 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp'
1.69 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp --use-reference'

(I ran dwebp as the first and the last candidate to negate any effects from my poor laptop's CPU overheating.)

This uses as_flattened_mut() which was stabilized in 1.80.0, so merging this probably requires raising the MSRV. I don't know your policy on that, but the alternative was adding unsafe or adding another dependency (that itself uses unsafe), so I left it as is.

PS:

I thought about extending the buffer with a few zero bytes so that everything can be read with the FastReader, but I don't think it would help much, and it might worst-case require reallocating the buffer.
read_literal has some obvious optimizations but it doesn't seem part of the latency critical path.
There might be an optimization in read_flag's 1 + ((range - 1) * 128) >> 8) but it seems hard to measure.
I think further optimizations to other parts of the decoding might push us to be faster than libwebp's performance.
I tried my hand at coaxing the compiler to apply SIMD to src/transform.rs, but it was very dependent on preventing function inlining, and ultimately I didn't get any noticable performance gains yet. I might try again later and create a separate PR.

Shnatsel · 2024-12-28T18:19:24Z

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones. You can create a lossless image with convert -quality 100 input.png output.webp and verify with webpinfo that the file is indeed lossless.

That said, lossless WebP is already plenty fast specifically due to optimizations to transforms. We actually beat dwebp -noasm in my tests, although dwebp when allowed to use handwritten assembly still beats us by 7% to 15% on lossless images.

Shnatsel · 2024-12-28T18:24:17Z

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

SLiV9 · 2024-12-28T20:20:51Z

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones.

Huh are you sure? I only mentioned it because idct4x4 showed up as 8% of the runtime in callgrind when running against the Puenta image. I did some optimizations that involved renaming that function and the new function was 7.5% or something like that, but anyway not enough to be measurable.

Not denying that it's already plenty fast, just that I'm certain it showed up in my call graphs inside read_coefficients().

SLiV9 · 2024-12-28T20:25:24Z

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

Yes that's the read_flag optimization I mentioned in the PR. I didn't end up doing it, and in fact the way I have the inlining set up it actually prevents the compiler from doing any special optimizations for the 128 case. That's because too much inlining/specialization seemed to make everything 20% slower, which I theorize to be because of instruction cache misses.

But indeed, that's something that can be revisited in a separate PR.

fintelia · 2024-12-28T20:49:02Z

transform.rs is for lossy images while lossless_transform.rs is for lossless images.

It might be worth renaming "bool reader" to "arithmetic decoder" or something to that effect, because it is doing boolean arithmetic coding rather than simply reading bits.

Shnatsel · 2024-12-29T17:19:13Z

FWIW there is no change on end-to-end benchmarks for the large image on my machine from the FastReader::read_flag optimization. It's possible that it helps other machines, just not mine.

Shnatsel · 2024-12-31T20:19:35Z

I can confirm this didn't break anything 🎉

No behavioral changes before and after on my corpus of 7,500 images scraped from the web.

Faster BoolReader

7c6450c

SLiV9 mentioned this pull request Dec 28, 2024

47% of the time is spent in BoolReader::read_bool when decoding lossy images #71

Open

Optimize FastReader::read_flag

c282cf6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster BoolReader #124

Faster BoolReader #124

SLiV9 commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

fintelia commented Dec 28, 2024

Shnatsel commented Dec 29, 2024

Shnatsel commented Dec 31, 2024

Faster BoolReader #124

Are you sure you want to change the base?

Faster BoolReader #124

Conversation

SLiV9 commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

Shnatsel commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

SLiV9 commented Dec 28, 2024

fintelia commented Dec 28, 2024

Shnatsel commented Dec 29, 2024

Shnatsel commented Dec 31, 2024