Skip to content

Commit 6ba0770

Browse files
committed
Add explanatory comment and reference for quarter round intrinsic
1 parent 41817c7 commit 6ba0770

File tree

1 file changed

+25
-0
lines changed

1 file changed

+25
-0
lines changed

src/hotspot/cpu/aarch64/stubGenerator_aarch64.cpp

+25
Original file line numberDiff line numberDiff line change
@@ -4348,6 +4348,31 @@ class StubGenerator: public StubCodeGenerator {
43484348
// state (int[16]) = c_rarg0
43494349
// keystream (byte[256]) = c_rarg1
43504350
// return - number of bytes of keystream (always 256)
4351+
//
4352+
// In this approach, we load the 512-bit start state sequentially into
4353+
// 4 128-bit vectors. We then make 4 4-vector copies of that starting
4354+
// state, with each successive set of 4 vectors having a +1 added into
4355+
// the first 32-bit lane of the 4th vector in that group (the counter).
4356+
// By doing this, we can perform the block function on 4 512-bit blocks
4357+
// within one run of this intrinsic.
4358+
// The alignment of the data across the 4-vector group is such that at
4359+
// the start it is already aligned for the first round of each two-round
4360+
// loop iteration. In other words, the corresponding lanes of each vector
4361+
// will contain the values needed for that quarter round operation (e.g.
4362+
// elements 0/4/8/12, 1/5/9/13, 2/6/10/14, etc.).
4363+
// In between each full round, a lane shift must occur. Within a loop
4364+
// iteration, between the first and second rounds, the 2nd, 3rd, and 4th
4365+
// vectors are rotated left 32, 64 and 96 bits, respectively. The result
4366+
// is effectively a diagonal orientation in columnar form. After the
4367+
// second full round, those registers are left-rotated again, this time
4368+
// 96, 64, and 32 bits - returning the vectors to their columnar organization.
4369+
// After all 10 iterations, the original state is added to each 4-vector
4370+
// working state along with the add mask, and the 4 vector groups are
4371+
// sequentially written to the memory dedicated for the output key stream.
4372+
//
4373+
// For a more detailed explanation, see Goll and Gueron, "Vectorization of
4374+
// ChaCha Stream Cipher", 2014 11th Int. Conf. on Information Technology:
4375+
// New Generations, Las Vegas, NV, USA, April 2014, DOI: 10.1109/ITNG.2014.33
43514376
address generate_chacha20Block_qrpar() {
43524377
Label L_Q_twoRounds, L_Q_cc20_const;
43534378
// The constant data is broken into two 128-bit segments to be loaded

0 commit comments

Comments
 (0)