@@ -4348,6 +4348,31 @@ class StubGenerator: public StubCodeGenerator {
4348
4348
// state (int[16]) = c_rarg0
4349
4349
// keystream (byte[256]) = c_rarg1
4350
4350
// return - number of bytes of keystream (always 256)
4351
+ //
4352
+ // In this approach, we load the 512-bit start state sequentially into
4353
+ // 4 128-bit vectors. We then make 4 4-vector copies of that starting
4354
+ // state, with each successive set of 4 vectors having a +1 added into
4355
+ // the first 32-bit lane of the 4th vector in that group (the counter).
4356
+ // By doing this, we can perform the block function on 4 512-bit blocks
4357
+ // within one run of this intrinsic.
4358
+ // The alignment of the data across the 4-vector group is such that at
4359
+ // the start it is already aligned for the first round of each two-round
4360
+ // loop iteration. In other words, the corresponding lanes of each vector
4361
+ // will contain the values needed for that quarter round operation (e.g.
4362
+ // elements 0/4/8/12, 1/5/9/13, 2/6/10/14, etc.).
4363
+ // In between each full round, a lane shift must occur. Within a loop
4364
+ // iteration, between the first and second rounds, the 2nd, 3rd, and 4th
4365
+ // vectors are rotated left 32, 64 and 96 bits, respectively. The result
4366
+ // is effectively a diagonal orientation in columnar form. After the
4367
+ // second full round, those registers are left-rotated again, this time
4368
+ // 96, 64, and 32 bits - returning the vectors to their columnar organization.
4369
+ // After all 10 iterations, the original state is added to each 4-vector
4370
+ // working state along with the add mask, and the 4 vector groups are
4371
+ // sequentially written to the memory dedicated for the output key stream.
4372
+ //
4373
+ // For a more detailed explanation, see Goll and Gueron, "Vectorization of
4374
+ // ChaCha Stream Cipher", 2014 11th Int. Conf. on Information Technology:
4375
+ // New Generations, Las Vegas, NV, USA, April 2014, DOI: 10.1109/ITNG.2014.33
4351
4376
address generate_chacha20Block_qrpar() {
4352
4377
Label L_Q_twoRounds, L_Q_cc20_const;
4353
4378
// The constant data is broken into two 128-bit segments to be loaded
0 commit comments