Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster BoolReader #124

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Faster BoolReader #124

wants to merge 2 commits into from

Conversation

SLiV9
Copy link

@SLiV9 SLiV9 commented Dec 28, 2024

  • Moved BoolReader to its own file.
  • It now reads from the buffer in chunks of 4 bytes at a time, except for the final 0-3 bytes.
  • Optimize successive calls to read_bool and read_with_tree by assuming none of them reach the end of the buffer and returning a transparent BitResult, then validating at the end.
  • Optimize each individual call to read_bool and read_with_tree by assuming each bit can be read from the 4-byte chunks (in FastReader), and retrying with the slow approach if this fails.

Final performance results are a 1.3x speedup compared to image-rs 0.2.0 (--use-reference), although it is still 1.3x slower than libwebp:

Summary
'dwebp -noasm -nofancy Puente.webp' ran
1.00 ± 0.01 times faster than 'dwebp -noasm -nofancy Puente.webp'
1.31 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp'
1.69 ± 0.01 times faster than 'target/release/image-webp-runner Puente.webp --use-reference'

(I ran dwebp as the first and the last candidate to negate any effects from my poor laptop's CPU overheating.)

This uses as_flattened_mut() which was stabilized in 1.80.0, so merging this probably requires raising the MSRV. I don't know your policy on that, but the alternative was adding unsafe or adding another dependency (that itself uses unsafe), so I left it as is.

PS:

  • I thought about extending the buffer with a few zero bytes so that everything can be read with the FastReader, but I don't think it would help much, and it might worst-case require reallocating the buffer.
  • read_literal has some obvious optimizations but it doesn't seem part of the latency critical path.
  • There might be an optimization in read_flag's 1 + ((range - 1) * 128) >> 8) but it seems hard to measure.
  • I think further optimizations to other parts of the decoding might push us to be faster than libwebp's performance.
  • I tried my hand at coaxing the compiler to apply SIMD to src/transform.rs, but it was very dependent on preventing function inlining, and ultimately I didn't get any noticable performance gains yet. I might try again later and create a separate PR.

@Shnatsel
Copy link
Contributor

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones. You can create a lossless image with convert -quality 100 input.png output.webp and verify with webpinfo that the file is indeed lossless.

That said, lossless WebP is already plenty fast specifically due to optimizations to transforms. We actually beat dwebp -noasm in my tests, although dwebp when allowed to use handwritten assembly still beats us by 7% to 15% on lossless images.

@Shnatsel
Copy link
Contributor

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

@SLiV9
Copy link
Author

SLiV9 commented Dec 28, 2024

transform.rs is used only for lossless images, so changing anything there won't affect lossy ones.

Huh are you sure? I only mentioned it because idct4x4 showed up as 8% of the runtime in callgrind when running against the Puenta image. I did some optimizations that involved renaming that function and the new function was 7.5% or something like that, but anyway not enough to be measurable.

Not denying that it's already plenty fast, just that I'm certain it showed up in my call graphs inside read_coefficients().

@SLiV9
Copy link
Author

SLiV9 commented Dec 28, 2024

Regarding bit reading: libwebp has a dedicated codepath for reading with probability 128 that is distinct from the general-purpose one. Is that something that you've explored?

If you haven't attempted it, it doesn't have to be a part of this PR. I just wanted to know if this has been attempted or not.

I would expect this not to matter if the hot variant of read_bool gets inlined anyway - the constant propagation should probably take care of it.

Yes that's the read_flag optimization I mentioned in the PR. I didn't end up doing it, and in fact the way I have the inlining set up it actually prevents the compiler from doing any special optimizations for the 128 case. That's because too much inlining/specialization seemed to make everything 20% slower, which I theorize to be because of instruction cache misses.

But indeed, that's something that can be revisited in a separate PR.

@fintelia
Copy link
Contributor

transform.rs is for lossy images while lossless_transform.rs is for lossless images.

It might be worth renaming "bool reader" to "arithmetic decoder" or something to that effect, because it is doing boolean arithmetic coding rather than simply reading bits.

@Shnatsel
Copy link
Contributor

FWIW there is no change on end-to-end benchmarks for the large image on my machine from the FastReader::read_flag optimization. It's possible that it helps other machines, just not mine.

@Shnatsel
Copy link
Contributor

I can confirm this didn't break anything 🎉

No behavioral changes before and after on my corpus of 7,500 images scraped from the web.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants