[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767

real-or-random · 2025-11-06T15:47:38Z

This adds a new representation geh of group elements, namely in homogeneous (also called projective) coordinates. This is supposed to be faster for unmixed (i.e., the second summand is not ge) addition in terms of field operations, namely 12M+0S+25etc vs. 12M+4S+11etc for Jacobian coordinates.

The addition and doubling formulas are due to Renes, Costello, and Batina 2016, Algorithms 7 and 9. The formulas are complete, i.e., they have no special cases. However, this implementation still keeps track of infinity in a dedicated boolean flag for performance reasons. Since the buckets in Pippenger's algorithm are initialized with infinity (=zero), we'll have many additions involving infinity, and going through the entire formula for each of those hurts performance (and the entire point of this PR is performance).

The formulas were implemented by giving GPT-5 mini screenshots of the algorithms in the paper and the field.h. The result was not awesome but I could clean it up manually.

The new representation is used in Pippenger's ecmult_multi for accumulating the buckets after every window iteration. Buckets are still constructed as gej (because it has faster mixed addition) and only converted to geh before accumulation. This is still supposed to be faster even if the conversion is accounted for. The conversion costs 2M+1S but we then do two geh additions in a row, saving 8S. This PR has three different variants of how geh could be used:

2ccc5e0 Only the inner accumulation loop is done in geh.
867fe34 All of the accumulation is done in geh.
0fc73f3 Like the previous, but we switch back to gej for rows of doublings.

Unfortunately, none of these turns out to be really faster in ecmult_bench pippenger_wnaf on my x86_64 system with gcc 15.2.1 or clang 21.1.4. The best variant (2) beats master by just 0.21%; the other variants are slower than master. :/ If I compile in 32-bit mode, all three variants beat master consistently, but only by 1.2%. But this latter result gives at least some hope that this PR could pay off on some platform. I'm not even sure how much we care about 32-bit platforms. Maybe we care about hardware wallets in general, but probably not when it comes to ecmult_multi. Plus this would need real benchmarks; I didn't even run this on a native 32-bit CPU).

But we'd certainly care about ARM64 which I couldn't test on. Anyone with an ARM Mac willing to benchmark this?

The exact benchmark command was SECP256K1_BENCH_ITERS=100000 bench_ecmult pippenger_wnaf (or 20000 iters for 32-bit). Don't forget the pippenger_wnaf argument to make sure you don't benchmark Strauss' algorithm instead, at least below the threshold where we switch to Pippenger automatically. I did this on a 12th Gen Intel(R) Core(TM) i7-1260P, pinned to a P-core, and with TurboBoost disabled. See the attached spreadsheet: for details. benchmark-gcc.ods

If you want to benchmark this, I think it makes sense to get four runs per setup: one for the baseline (d0f3123, just disabling low point counts in bench_ecmult for quicker benchmarking) and the three "step" commits as mentioned above. You could just extend the spreadsheet with your results.

Also, if you have any ideas on how to improve this further, I'd be happy to hear them. I tried various micro-optimizations, but none of them turned out to be significant on my machine. In fact, most of them made the code slower in practice. In theory, this PR should make it possible to increase the window size a bit, but playing around with the window size didn't make a difference either in practice.

edit: Don't care about CI. It fails on some platforms because I forgot to mark functions static. This should compile locally without issues.

Even though the formulas are complete, infinity is special cased for performance.

theStack · 2025-11-07T20:42:10Z

Ran the benchmarks on my arm64 notebook (it's a Lenovo Thinkpad T14s Gen 6, with a Qualcomm Snapdragon X Elite CPU, using gcc 14.2.0) using a hacked-together build-and-benchmark script and got the following results: https://gist.github.com/theStack/897d7b50b5b8a6f288ed2b817fcca9fc
Based on only this, it looks like only variant 1 (2ccc5e0) is a bit faster (~1%-ish, looking at the last few lines each?), variant 2 is pretty much the same and variant 3 is worse. Will give it a few more runs next week to verify if these results are consistent.

siv2r

I’ve benchmarked this pull request on my Macbook Pro: M4 Pro Chip (ARM64, 12 Cores: 8P + 4E) | macOS 15.6.1 | Plugged in & no background apps.

I used this python script to benchmark the code. It’s basically an adapted version of @theStack’s bash script. But extends it to parse the .txt benchmark files to output an .xlsx comparison of performance improvements.

In my benchmarks, all three variants are slightly slower (~0.5% on average) than master, with v2 performing best, followed by v3 and v1.

I tried to keep the benchmarks as accurate as possible. From my brief readings, on macOS with Apple Silicon, you can't actually pin processes to specific P-cores like you would with taskset on Linux. I'll read more on this and see if there’s a workaround. Hopefully that’d make the benchmark more robust.

real-or-random added 9 commits November 4, 2025 09:52

HACK: Don't run unrelevant stuff in bench_ecmult

d0f3123

group/geh: Add homogeneous group representation

8eca7f6

Even though the formulas are complete, infinity is special cased for performance.

geh: Add testutil functions (TODO: unused)

75d0e66

geh: Add some tests

a995c0e

geh: Add internal benchmarks (TODO naming)

0f21019

pippenger/geh step 1: Use geh in accumulation loop

2ccc5e0

pippenger/geh step 2: Use geh in entire accumulation

867fe34

pippenger/geh step 3: Go back to gej for doublings

0fc73f3

fixups

9cfba13

real-or-random added the performance label Nov 6, 2025

real-or-random mentioned this pull request Nov 6, 2025

Try a non-uniform group law (e.g., for ecmult_gen)? #1051

Open

siv2r reviewed Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767

[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767

real-or-random commented Nov 6, 2025 •

edited

Loading

Uh oh!

theStack commented Nov 7, 2025

Uh oh!

siv2r left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767

Are you sure you want to change the base?

[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767

Conversation

real-or-random commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theStack commented Nov 7, 2025

Uh oh!

siv2r left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

real-or-random commented Nov 6, 2025 •

edited

Loading