-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767
Conversation
Even though the formulas are complete, infinity is special cased for performance.
|
Ran the benchmarks on my arm64 notebook (it's a Lenovo Thinkpad T14s Gen 6, with a Qualcomm Snapdragon X Elite CPU, using gcc 14.2.0) using a hacked-together build-and-benchmark script and got the following results: https://gist.github.com/theStack/897d7b50b5b8a6f288ed2b817fcca9fc |
siv2r
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve benchmarked this pull request on my Macbook Pro: M4 Pro Chip (ARM64, 12 Cores: 8P + 4E) | macOS 15.6.1 | Plugged in & no background apps.
I used this python script to benchmark the code. It’s basically an adapted version of @theStack’s bash script. But extends it to parse the .txt benchmark files to output an .xlsx comparison of performance improvements.
In my benchmarks, all three variants are slightly slower (~0.5% on average) than master, with v2 performing best, followed by v3 and v1.
I tried to keep the benchmarks as accurate as possible. From my brief readings, on macOS with Apple Silicon, you can't actually pin processes to specific P-cores like you would with taskset on Linux. I'll read more on this and see if there’s a workaround. Hopefully that’d make the benchmark more robust.
This adds a new representation
gehof group elements, namely in homogeneous (also called projective) coordinates. This is supposed to be faster for unmixed (i.e., the second summand is notge) addition in terms of field operations, namely 12M+0S+25etc vs. 12M+4S+11etc for Jacobian coordinates.The addition and doubling formulas are due to Renes, Costello, and Batina 2016, Algorithms 7 and 9. The formulas are complete, i.e., they have no special cases. However, this implementation still keeps track of infinity in a dedicated boolean flag for performance reasons. Since the buckets in Pippenger's algorithm are initialized with infinity (=zero), we'll have many additions involving infinity, and going through the entire formula for each of those hurts performance (and the entire point of this PR is performance).
The formulas were implemented by giving GPT-5 mini screenshots of the algorithms in the paper and the
field.h. The result was not awesome but I could clean it up manually.The new representation is used in Pippenger's ecmult_multi for accumulating the buckets after every window iteration. Buckets are still constructed as
gej(because it has faster mixed addition) and only converted togehbefore accumulation. This is still supposed to be faster even if the conversion is accounted for. The conversion costs 2M+1S but we then do twogehadditions in a row, saving 8S. This PR has three different variants of howgehcould be used:geh.geh.gejfor rows of doublings.Unfortunately, none of these turns out to be really faster in
ecmult_bench pippenger_wnafon my x86_64 system with gcc 15.2.1 or clang 21.1.4. The best variant (2) beats master by just 0.21%; the other variants are slower than master. :/ If I compile in 32-bit mode, all three variants beat master consistently, but only by 1.2%. But this latter result gives at least some hope that this PR could pay off on some platform. I'm not even sure how much we care about 32-bit platforms. Maybe we care about hardware wallets in general, but probably not when it comes toecmult_multi. Plus this would need real benchmarks; I didn't even run this on a native 32-bit CPU).But we'd certainly care about ARM64 which I couldn't test on. Anyone with an ARM Mac willing to benchmark this?
The exact benchmark command was
SECP256K1_BENCH_ITERS=100000 bench_ecmult pippenger_wnaf(or20000iters for 32-bit). Don't forget thepippenger_wnafargument to make sure you don't benchmark Strauss' algorithm instead, at least below the threshold where we switch to Pippenger automatically. I did this on a 12th Gen Intel(R) Core(TM) i7-1260P, pinned to a P-core, and with TurboBoost disabled. See the attached spreadsheet: for details. benchmark-gcc.odsIf you want to benchmark this, I think it makes sense to get four runs per setup: one for the baseline (d0f3123, just disabling low point counts in
bench_ecmultfor quicker benchmarking) and the three "step" commits as mentioned above. You could just extend the spreadsheet with your results.Also, if you have any ideas on how to improve this further, I'd be happy to hear them. I tried various micro-optimizations, but none of them turned out to be significant on my machine. In fact, most of them made the code slower in practice. In theory, this PR should make it possible to increase the window size a bit, but playing around with the window size didn't make a difference either in practice.
edit: Don't care about CI. It fails on some platforms because I forgot to mark functions
static. This should compile locally without issues.