-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy paththesis.tex
1473 lines (1091 loc) · 110 KB
/
thesis.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[nolof,digital]{fithesis3}
\thesissetup{faculty=fi}
\thesissetup{author=Bc. Vojtěch Polášek,id=410266, departmentEn = Department of Computer Systems and Communications, programmeEn = Informatics, fieldEn = Information Technology Security, assignment = {}}
\thesissetup{type= mgr}
\thesissetup{title=Argon2 security margin for disk encryption passwords}
\thesissetup{keywords = {security, password hashing, key derivation, disk encryption, password cracking, LUKS, LUKS2}}
\usepackage[english]{babel}
\usepackage{alltt}
\usepackage{listings}
\usepackage{csquotes}
\usepackage[hidelinks]{hyperref}
\usepackage[style=ieee,sorting=nty,block=ragged]{biblatex}
\renewbibmacro*{bbx:savehash}{} %disabling dashing
\addbibresource{bibliography.bib}
\usepackage{tabularx}
\usepackage{placeins}
\usepackage{tabu}
%\usepackage[chars=60, lines=30]{stdpage}
\usepackage{algorithm2e}
\SetAlCapSkip{3mm}
\thesislong{abstract}{
Passwords are a popular authentication method in the field of information technology. Passwords were created for humans to be remembered. Sometimes they are not ideal for usage in encryption software. Therefore, there exist key derivation functions, which transform a password into more suitable cryptographic key.
This thesis deals with such functions, in particular considering their usage in disk encryption. The most popular function PBKDF2 is described together with its vulnerabilities and attacks. Memory-hard functions have started being used as a mitigation of time-memory trade-off attacks. One of such functions is Argon2 selected as a winner of Password Hashing Competition. The thesis describes Argon2 in detail.
The practical part of the thesis deals with simulating of an attack on a disk encrypted with LUKS2 encryption scheme using Argon2 as PBKDF. It includes collecting Argon2 parameters benchmarked by Cryptsetup software. Attack is devised through CPUs and GPUs using high-performance hardware provided by MetaCentrum VO.
The last part of the thesis introduces a price model for an attacker using either physical hardware or on-demand allocation of computing resources in the cloud. This model is then applied to real world prices and data obtained during the attack simulation.
The thesis shows that it can take thousands of machines and hundreds of millions of dollars to crack a LUKS2 password eight characters long in ten years.
}
\thesissetup{advisor=Ing. Milan Brož\, Ph.D.}
\thesislong{thanks}{
My big thanks go to my advisor Milan Brož for very inspirational and positive supervision and for help with gaining access to hardware needed for benchmarking.
Furthermore, I would like to thank Ondrej Mosnáček for his consultations concerning Argon2 and for the software used in this thesis.
My thanks also go to Lukáš Másilko for help with polishing of graphical appearance of the thesis.
Last but not least, I would like to thank my family and my girlfriend Justyna for supporting me during the whole process.
Computational resources were provided by the CESNET LM2015042 and the CERIT Scientific Cloud LM2015085, provided under the programme "Projects of Large Research, Development, and Innovations Infrastructures". Computational resources were supplied by the Ministry of Education, Youth and Sports of the Czech Republic under the Projects CESNET (Project No. LM2015042) and CERIT-Scientific Cloud (Project No. LM2015085) provided within the program Projects of Large Research, Development and Innovations Infrastructures.
}
\thesisload
\setcounter{tocdepth}{2}
\begin{document}
\chapter{Introduction}
This diploma thesis deals with password-based key derivation functions and their application in disk encryption software. At the very beginning of the computer era there were no passwords. They appeared for the first time in the 1960s and they were introduced by Fernando Corbató \parencite{ctss}. They are going through evolving process as they try to keep pace with developing technology and new ways in which the technology is being used.
Passwords are becoming more and more important. They protect our secrets, our money, our communication. All these things are valuable not only to us but also to other people and some of them would like to gain access to them for malicious purposes. There are many ways how to gain access to the secret passphrase including guessing, social engineering, exploiting application vulnerabilities etc. This thesis deals with the method mentioned first and it tries to look at it from the attacker's perspective. To be more exact, it considers this option from the perspective of attacker's financial resources. It tries to answer a question: How much would it cost to guess a password opening an encrypted disk volume? The thesis shows that to guess a password for LUKS encrypted volume even if created on low performance hardware costs currently more than a hundred million dollars and involves thousands of purchased or rented machines.
The second chapter introduces basic terminology and rationale for usage of PBKDFs in scope of disk encryption. It also summarizes history and possible attacks against PBKDFs. The third chapter describes in detail the PBKDF2 function and the Argon2 function. During their analysis the rationale for using memory-hard functions is shown. The Argon2 function is described in greater detail because it is fundamental for rest of the thesis. At the end of the chapter several other password hashing functions and their features are briefly mentioned.
The fourth chapter analyses the LUKS2 disk encryption scheme and its front-end called Cryptsetup. First the inner workings of LUKS2 are introduced with special focus on usage of PBKDFs in various processes. The rationale for benchmarking of PBKDF parameters is explained. This benchmarking function is then used to collect real world data to be used in subsequent analysis. The last part is dedicated to experimental simulation of an attacker trying to break LUKS2 disk encryption by guessing the passphrase. This process inevitably requires processing of PBKDFs. Methodology, tools and highlights from resulting data are shown.
The fifth chapter summarizes options of a potential attacker and introduces the price model suited for estimating final price of an attack against LUKS2 encrypted volume. The model is applied to real world data based on results from the previous chapter. The option of using CPUs, GPUs and cloud based on-demand allocated resources is taken into account.
The sixth chapter provides conclusions and several paths for further research in this area. The thesis is closed with description of attached archive with source code, scripts, and results, and glossary is provided.
\chapter[Password hashing and key derivation functions]{Password hashing and key derivation\\functions}
\section{Definitions}
\label{definitions}
This thesis deals with various cryptographic terms including password hashing and key derivation. This section briefly explains some of them.
A \emph{hashing function} is a function which receives an input of arbitrary length and produces an output of specified shorter length, effectively compressing the input. These functions are used in many areas such as effective data retrieval \parencite{itmc14}. \emph{Cryptographic hashing functions} are subset of \emph{hashing functions} and they have to meet certain properties, namely preimage resistance, second preimage resistance and collision resistance. Only cryptographic hash functions are considered in this thesis.
\emph{Password hashing} is a process in which a password is supplied to a hash function. This is de facto standard method of storing of saved passwords in operating systems and applications. In case that an attacker gets hold of such password hashes, it should be practically infeasible for an attacker to derive the original password. Therefore, hashing functions are used also in process of password verification, during which the entered password is hashed and compared with the stored hash.
This thesis is not going to deal with \emph{password hashing}. However, many \emph{password hashing functions} described below meet desired properties for being used in the key derivation process.
\emph{Key derivation functions} are based on \emph{hash functions}. Their basic purpose is to take an input and produce an output which can be used as a cryptographic key. The input is usually a password or other material such as biometric sample converted into binary form. These materials could be of course used as cryptographic keys on their own, but they often lack properties of a good cryptographic key such as sufficient entropy or length.
A \emph{cryptographic salt} is often used during process of key derivation. The purpose of salt is to prevent attacks which use precomputed tables such as Rainbow tables \parencite{rainbowtables}. Salt introduces another factor which influences a derived key. It means that it is no longer dependent only on passphrase. For example suppose that 32 bit long integer is used as a salt. In that case there are \(2^{32}\) possible keys derived from the same passphrase. This makes precomputing attacks effectively infeasible. Salt is usually stored unobfuscated together with hashed material.
A \emph{memory-hard algorithm} is an algorithm which performs asymptotically almost the same number of operations compared to number of accessed memory locations. A \emph{sequential memory-hard function} is defined as a function which can be computed by a sequential memory-hard algorithm and at the same time even the fastest parallel algorithm cannot asymptotically reach a significantly lower cost.
\section{Why do we need PBKDFs?}
\label{whypbkdfs}
Today, as more and more private information is stored on various kinds of media and transferred over the Internet, it is becoming crucial to protect it from being accessed or changed by unauthorized actors. Although there are several interesting authentication options such as biometrics, passwords or passphrases are still the most common method.
Considering passwords we are facing a problematic situation. Organizations and services provide guidelines or requirements which should help a user to choose a strong password \parencites{nistpasswords}{sanspasswordguidelines}. Important parameters are password length (in characters), password complexity, uniqueness and others. I define complexity as amount and diversity of used characters (letters, numbers, symbols, emojis\dots) and by uniqueness I mean the fact that the password does not contain easily guessable or predictable sequences. See mentioned policies for example.
In some situations passwords do not provide a good cryptographic material to be used as a cryptographic key. There are surprisingly many reasons. They are usually not sufficiently long. Because they are composed of printable characters, they do not meet the requirement of being uniformly distributed. If they should be remembered, they will probably contain dictionary words, which lessens their entropy even more. See section 5.6.4 at \parencite{itmc14} for short but interesting analysis.
PBKDF stands for password-based key derivation function. The goal of these functions is to derive one or more cryptographic keys from a password or a passphrase. This key should be pseudorandom and sufficiently long to make brute-force guessing as time-consuming as possible. As stated above, they are based on cryptographic hashing functions.
Lately PBKDFs are taking another specific task. Due to availability of GPUs, FPGAs and ASICs, there are new possibilities of running functions in parallel computing environment. See chapter 4 at \parencite{mosnacek}. This increases effectiveness of brute-force attacks. PBKDFs try to defend against such attacks by using memory-hard algorithms to slow down potential attackers and to make running the function in parallel extremely expensive or inefficient. See section~\ref{sec:attacks} for more details.
Examples of PBKDFs include Argon2, PBKDF2, Scrypt, Yescrypt and more. See chapter~\ref{chap:pbkdfs} for comparison of several functions.
\section{PBKDFs and disk encryption}
Disk encryption is a very good use case for usage of PBKDFs. Used encryption algorithms require cryptographic keys of certain length \parencite{veracrypt}. It is also important to consider the fact, that it is usually not desirable to change the encryption key often because reencryption of whole disk takes considerable amount of time. Let aside the fact that if an attacker gains permanent access to such an encrypted disk, the key cannot be changed at all and there exists an extensive period of time for cracking of the passphrase.
By looking at \parencite{pbkdf2usage} we can see that PBKDFs are used in many types of disk encryption software. Note that this list mentions only PBKDF2 as this has been most used PBKDF until recent times. PBKDF2 is for example used in LUKS version 1 \parencite{luks1}, FileVault software used by macOS \parencite{filevault}, CipherShed disk encryption software \parencite{ciphershed}, Veracrypt disk encryption software \parencite{veracrypt} and more.
In 2013 there was initiated a new open competition called Password Hashing Competition. Its goal was to find a new password hashing function which would resist new attacks devised against those functions \parencite{phc}. The winner was function named Argon2. It is already used for example in LUKS version 2 \parencite{luks2}.
\section{Attacks on PBKDFs}
\label{sec:attacks}
History of password cracking is as old as computer passwords. People crack passwords for two main reasons. They either want to recover a forgotten password, or they want to recover password of someone else, later being able to use it for authentication purposes. There exist two main techniques for password cracking; brute force attacks and dictionary attacks. The idea behind the first method is to guess the password by trying all passwords of given length composed of all combinations of given characters. Dictionary attacks exploit the fact that people often use passwords containing words found in language dictionaries. It means that if an attacker tries passwords containing dictionary words or permutations, they might have quite good chance of success.
At the beginning passwords were stored on computer systems in plain text, protected only by the fact that users should not be able to read passwords of other users. But soon it had been discovered that plain text passwords could be revealed for example by badly designed software permissions. This was shown in the case of Allan Scherr, who misused capabilities of a printing program to print out whole password file, see page 37 at \parencite{ctss}. Since that time, passwords started to be hashed.
This time marks beginning of the never-ending fight between authors of hash functions and people trying to crack passwords hashed by them. First hash functions were really simple, and they were definitely not cryptographically secure, such as hash mechanism used in Multics. This mechanism squared numerical form of each password and applied a bit mask with AND operation \parencite{multicssecurity}. This increased number of guesses but only negligibly compared to modern functions.
The first cryptographic hash came with Robert Morris and his Crypt function. Crypt used up to Unix 6th edition mimicked the M209 cipher machine from World War II. It proved to be not very secure because the algorithm could be recoded in a way which allowed to test passwords in very short period of time (1.25 milliseconds per password) \parencite{pshistory}. Later version used since Unix 7th edition employed the DES block cipher. This cipher was at that time very slow if implemented in software. The password entry program also introduced two new concepts; automated proactive password strength checking and \emph{cryptographic salt} (12 bit random number at that time). Currently for example the LUKS2 specification uses 32 bytes long salt.
During the 1980s there were organized some password cracking contests and hash functions were also improved. Some of them were made deliberately slow to slow down potential attacker. If this measure was not effective enough, more iterations of hashing function could be used. During the 1990s many password cracking programs appeared including John the Ripper, Crack, LOphtCrack and others \parencite{crackinghistory}. Most of those programs tried to improve the cracking speed by optimizing underlying algorithms and later by using CPU parallelism.
The concept of \emph{KDFs} started being studied in late 1990s. The RFC 2898 for PBKDF2 was released in 2000 and it started being used primarily for key derivation in many applications such as WinZip, Opendocument, Truecrypt or Android. However, it was also used for actual password hashing for example in Mac OS X 10.8. Note that PBKDF1 exists, but it is not recommended because of its limited key length (20 bytes at best) and it is provided solely for backward compatibility. PBKDF2 introduced configurable pseudorandom function, number of iterations and derived key length. These parameters allow flexibility while choosing trade-off between security and user experience.
However, in 2007 there appeared the first password cracker using parallel computing capabilities of GPUs \parencite{elcomsoftgpu}. Other password crackers followed; Whitepixel in 2010, Oclhashcat in 2012, John the ripper in 2012. Until 2012 software could recover primarily MD5 and NTLM hashes, but Oclhashcat introduced recovery of many other hashes.
As far as PBKDF2 is considered, its resistance to password cracking using GPUs or even ASICs/FPGAs is currently not ideal as discussed in section 7 at \parencite{mosnacek}. PBKDF2 does not offer support for parallelism while used as suggested. However, at the same time its low memory requirements and GPU friendly algorithm (see algorithm~\ref{pbkdf2alg}) bring advantage to the attacker. As shown in \parencite{mosnacek}, it was possible to improve cracking speed of LUKS passwords forty times. This required rewriting the function for GPUs. Moreover, it is shown that proper optimization of underlying algorithms can greatly increase PBKDF2 performance even without GPUs. This was shown after analysis of closed source Oclhashcat in \parencite{pbkdf2accel}.
As shown in \parencite{pbkdf2weakness}, there exist other attacks not connected with GPU. This paper shows that an attacker can save 50 \% of PBKDF2 CPU operations if the PBKDF2 is not implemented according to suggested performance improvements described in RFC 2104 \parencite{rfc2104} and NIST FIPS PUB 198 \parencite{fipspub198}. In this case it is possible to precompute first message block of underlying keyed hash function (used as PRF) and replace it with resulting constant in subsequent operations. See lines 11--13 in algorithm \ref{pbkdf2alg}. Note that HMAC function is not described in this thesis, see section~4.1 at \parencite{pbkdf2weakness}.
In section~4.2 \parencite{pbkdf2weakness} there is shown that an attacker can omit considerable amount of XOR operations while using SHA1 as a pseudorandom function within PBKDF2 because this operation is sometimes performed on two blocks containing only zeroes. Additionally, more XOR operations can be omitted because of padding characters which are constant and some XOR operations in this case just zero out themselves. Finally, in section 4.3, it is shown that it is possible to precompute the word expansion part of the second message block of a keyed hash function. The block is password independent and can be thus precomputed. However, this saves only negligible amount of time compared to previous attacks described in this paragraph.
In 2009 Colin Percival suggested that to defend against usage of parallel computing, PBKDFs should fulfil requirements of memory-hard functions \parencite{memoryhard}. As a reaction to previously mentioned problems of PBKDF2, the Password Hashing Competition was held from 2013 to 2015 to select a new function for password hashing. The winner is Argon2 described in section~\ref{argon2} and it is indeed a memory-hard function. Of course that does not mean that memory-hard functions are not prone to attacks.
In a paper from 2016, authors show that there still exists an attack which can decrease computational complexity of Argon2i version 1.0 function. It was shown that at that time it was needed to configure at least ten passes of Argon2i version 1.0 to mitigate this attack. At the time of releasing the paper, the IRTF proposal suggested only six passes for "paranoid" situations. Fortunately authors of Argon2 reacted to those attacks and improved the function, therefore the attack is not effective against Argon2i any more except when running with only single pass. Note that current version of Argon2 is 1.3. Details and rationale can be found in section~5.2 at \parencite{argon2}.
\chapter{State of the art PBKDFs}
\label{chap:pbkdfs}
\section{PBKDF2}
\label{sec:pbkdf2}
PBKDF2 is a password-based key derivation function defined in RFC 8018 \parencite{rfc8018}. This RFC thoroughly describes two use cases of PBKDF2; password-based encryption scheme and password-based message authentication scheme. Other mentioned use cases include password checking and derivation of multiple keys from one password. As shown in \ref{luks1}, LUKS version 1 uses PBKDF2 for password checking and derivation of key for encryption or decryption of master key.
The function requires four input parameters; passphrase, cryptographic salt, iteration count and length of a key to be derived. Moreover, the function requires a pseudorandom function (PRF) which is used in process of key derivation.
The term \emph{Passphrase} in this context means any data which are source for subsequent key derivation process. Usually it is a password entered by user. The \emph{cryptographic salt} is represented by randomly generated number.
The purpose of \emph{iteration count} parameter is to defend against brute force and dictionary attacks performed on PBKDF2. The iteration count prolongs the time which is needed to derive a single key. Technically, the iteration count signifies number of successive runs of chosen PRF for every block of the derived key. In general, the \emph{iteration count} should be chosen as large as possible, taking into account the fact that the processing time should be acceptable for the end user \parencite{nistpbkdf2}. According to the cited document, minimum iteration count should be 1,000 iterations and for critical security systems a count of 10,000000 iterations is appropriate.
The function is described by following algorithm. Verbal description is also provided. Following abbreviations and conventions are used in the algorithm and description:
\begin{description}
\item[P] -- an octet string representing a passphrase
\item[S] -- an octet string representing a cryptographic salt
\item[C] -- a positive integer representing iteration count
\item[dkLen] -- a positive integer representing length of the derived key counted in octets
\item[DK] -- an octet string representing the derived key
\item[PRF] -- a pseudorandom function
\item[hLen] -- length of output of chosen pseudorandom function counted in octets
\item[CEIL(x)] -- the ceiling function returning the smallest integer which is greater or equal to X
\item[F] -- a helper function for better description
\item[||] -- concatenation of strings
\item[INT(x)] -- a big-endian encoding of integer x
\end{description}
\begin{algorithm}
\DontPrintSemicolon
\LinesNumbered
\SetKwInOut{Input}{input}
\SetKwInOut{Output}{output}
\SetKwFunction{KwCeil}{CEIL}
\SetKwFunction{KwF}{F}
\SetKwFunction{KwPrf}{PRF}
\SetKwProg{Fn}{Function}{}{}
\Input{P, S, C, dkLen}
\Output{DK}
\If{\(dkLen > (2^{32} - 1) \times hLen\)}{\Return{Derived key too long}}
\(L \leftarrow \) \KwCeil{\(dkLen/hLen\)} \;
\tcc{l is the number of hLen-octet blocks in the derived key}
\(r \leftarrow dkLen - (l - 1) \times hLen \)
\tcc{r is the number of octets in the last block}
\For{\(i \leftarrow 1\) \KwTo \(l\) }{
\(T_i \leftarrow \) \KwF{p, s, c, i} \;
}
\Return{\( t_1 || t_2 || \dots t_l [0 \dots (r-1)]\)} \;
\Fn{F (s, p, c, i)}{
\( u_1 \leftarrow \) \KwPrf{P, S || INT(i)} \;
\For{\( j \leftarrow 2 \) \KwTo \( c \)}{
\( u_j \leftarrow \) \KwPrf{P, \( u_{j-1} \)} \;
}
\Return{\(u_1 \oplus u_2 \oplus \dots \oplus u_c \)} \;
}
\caption{PBKDF2 function algorithm}
\label{pbkdf2alg}
\end{algorithm}
At the beginning the algorithm checks if the desired length of the key does not exceed \(2^{32} - 1\). If it does, it exits immediately. Then it processes the input and creates output key in blocks. Every block has length of hLen octets, except for the last one which can have shorter length.
In the pseudocode there is defined function \emph{F} which is applied to every block. Results of such applications are finally concatenated and returned as the resulting derived key. This function performs \emph{c} iterations of underlying pseudorandom function \emph{PRF}. The \emph{PRF} takes a passphrase as the first parameter and result of previous iteration as the second parameter. The only exception is the first iteration where the second parameter is concatenation of salt and binary representation of the block index \emph{i}. Results of all iterations are xored and returned as a particular block of the derived key. Finally, all blocks are concatenated and returned as the derived key.
Notice that the function F can be rewritten to be quickly computed in parallel computing environments such as GPUs. See section~4.1 at \parencite{mosnacek} for more details.
\section{Argon2}
\label{argon2}
As mentioned in very brief history of attacks on password hashes in section~\ref{sec:attacks}, the Argon2 function is the winner of Password Hashing Competition. Argon2 is a hash function belonging to the set of memory-hard functions as defined in \parencite{memoryhard}. Current version of the function is 1.3 and the latest IETF draft is \parencite{argon2draft}.
The function quickly fills up given amount of memory and performs a sequence of computations over values stored in this memory. The Argon2 comes in three versions which differ in the way in which data in the memory matrix (described further below) is processed. See subsection~\ref{argon2versions} for detailed description.
Argon2 is used as the default PBKDF in LUKS version 2. Note that according to \parencite{cryptsetupmanual} the default PBKDF can be configured during compilation.
The function is optimized for X86 architecture using improvements in handling of cache and memory access in recent Intel and AMD processors. To be exact, currently the Argon2 can utilize SSE2, SSSE3, XOP, AVX2 and AVX512F CPU instructions. All the instructions except for XOP are Intel specific, XOP is specific for AMD processors.
The function can be implemented on specialized hardware, but the previously mentioned fact makes this implementation possibly very slow and expensive and even specialized ASICs shouldn't acquire significant benefit even if they employ large areas of memory. However, no implementations of any Argon2 mode for specialized hardware are known so far.
\subsection{Operation}
The function expects following mandatory input parameters.
\begin{description}
\item[P] -- the message of any length from 0 to \(2^{32} - 1\) bytes. In case of PBKDF this is the passphrase.
\item[S] -- nonce with length from 8 to \(2^{32} - 1\) bytes. In case of PBKDF this is the \emph{cryptographic salt}.
\end{description}
The function also accepts secondary inputs, which do not need to be supplied.
\begin{description}
\item[p] -- degree of parallelism as an integer with a value from 1 to \(2^{24} - 1\). The value determines the number of parallel computational chains to be run. Chains are not independent, synchronization occurs.
\item[\(\tau\)] -- tag length in bytes in range from 4 to \(2^{32} - 1\). This determines an output of the function, in case of PBKDF key length.
\item[m] -- memory size as an integer number in range from 8\emph{p} to \(2^{32} - 1\). The integer determines amount of kilobytes of memory which should be used for computation of the function.
\item[t] -- number of iterations as an integer in range from 1 to \(2^{32} - 1\). This parameter is used to tune the length of the function run by specifying number of iterations.
\item[v] -- one byte version number, currently hard-coded to 0x13.
\item[K] -- a secret value (key) with length from 0 to \(2^{32} - 1\) bytes. By default, no key is assumed.
\item[X] -- associated data with length from 0 to \(2^{32} - 1\) bytes.
\item[y] -- type (mode) of Argon2 to be used. 0 for Argon2d, 1 for Argon2i, 2 for Argon2id.
\end{description}
Argon2 makes use of the permutation function \emph{P} which is based on Blake2b hash function \parencite{blake2}. The function actually copies blake2b design but additionally it uses 64 bit multiplications. This particular modification makes the function more complicated to implement and optimize for ASICs, while the running speed on X86 processors should be degraded only negligibly. See section~3.6 at \parencite{argon2draft} for detailed explanation.
Another internal function of Argon2 is compression function \emph{G} which internally uses previously mentioned function \emph{P}. \emph{G} accepts two 1,024 bytes long inputs and produces one 1,024 bytes long output. Let \emph{X} and \emph{Y} be the inputs. Firstly, the function XORs them:
\(R = X \oplus Y \)
R is treated as a \(8 \times 8 \) matrix of 16 byte registers. The function \emph{P} is first applied to every row of the matrix and then to every column. The result is denoted as \emph{Z}. Finally, the result is computed as \(Z \oplus R\). For more detailed description see section~3.5 at \parencite{argon2draft}.
Argon2 makes use of two more hash functions. The \emph{\(H^X\)} function where \emph{X} denotes the output length in bytes and the whole function is again based on Blake2b. Finally, the variable length hash function \emph{\(H'^X\)} based on \emph{\(H^X\)} defined in section~3.3 at \parencite{argon2draft} is also used.
The argon2 operation can be described as follows. The emphasized variables are Argon2 input parameters, primary and secondary parameters are not distinguished. The numbers in brackets denote the line numbers in algorithm~\ref{argon2alg}.
\begin{enumerate}
\item Initialize the block \(H_0\).
\item Allocate the memory according to \emph{m} and \emph{p} parameters. The real size of allocated memory is denoted with \(m'\) in 1,024 byte blocks. Note that the memory is treated as a matrix \(B[I][J]\) with \emph{p} rows and \(q = m' / p\) columns.
\item Compute \(B[i][0]\) for \(0 <= i < p\).
\item Compute \(B[i][1[\) for \(0 <= i < p\).
\item Compute \(B[i][j]\) for \(0 <= i < p\) and \(2 <= j < q\). This step is different for every Argon2 mode.
\item If \emph{\(t\)}\(> 1\) then repeat step 5 with slight change for every iteration.
\item The final block C is computed.
\item The output tag is computed.
\end{enumerate}
\subsection{Algorithm}
This subsection describes Argon2 through pseudocode. In addition to input parameters mentioned earlier, following notations will be used:
\begin{description}
\item[\(||\)] -- concatenation of two strings
\item[floor(x)] -- function returns the largest integer which is not larger than x
\item[ceil(x)] -- function returns the smallest integer which is not smaller than x
\item[LE32(x)] -- converts 32 bit long integer x to byte string in little endian
\item[length(s)] -- returns length of string s in bytes as a 32 bytes long integer
\item[allocate(x)] -- allocates x bytes of memory
\end{description}
\begin{algorithm}
\DontPrintSemicolon
\LinesNumbered
\SetKwInOut{Input}{input}
\SetKwInOut{Output}{output}
\SetKwFunction{KwCeil}{ceil}
\SetKwFunction{KwFloor}{floor}
\SetKwFunction{KwHapostrophe}{\(H'\)}
\SetKwFunction{KwHapostropheonetwoeight}{\(H'^{128}\)}
\SetKwFunction{KwG}{G}
\SetKwFunction{KwHsixfour}{H64}
\SetKwFunction{KwLethreetwo}{LE32}
\SetKwFunction{KwLength}{length}
\SetKwFunction{KwAllocate}{allocate}
\SetKwFunction{KwHapostrophetau}{\(H'^\tau\)}
\Input{P, S, p, \(\tau\), m, t, v, K, X, y}
\Output{TAG}
\(H_0 \leftarrow \) \KwHsixfour{\KwLethreetwo{p} || \KwLethreetwo{\(\tau\)} || \KwLethreetwo{M} || \KwLethreetwo{t} || \KwLethreetwo{v} || \KwLethreetwo{y} || \KwLethreetwo{\KwLength{P}} || P || \KwLethreetwo{\KwLength{S}} || S || \KwLethreetwo{\KwLength{K}} || K || \KwLethreetwo{\KwLength{X}} || X } \;
\(M' \leftarrow 4 \times p \times \) \KwFloor{\(m / 4p\)} \;
\(B \leftarrow \) \KwAllocate{\(m'\)} \;
\For{\( i \leftarrow 0 \) \KwTo \(p - 1 \)} {
\(B[i][0] \leftarrow \) \KwHapostropheonetwoeight{\(H_0 \) || \KwLethreetwo{0} || \KwLethreetwo{i}} \;
}
\For{\(i \leftarrow 0 \) \KwTo \( - 1p\)} {
\(B[i][1] \leftarrow \) \KwHapostropheonetwoeight{\(H_0 \) || \KwLethreetwo{1} || \KwLethreetwo{i}} \;
}
\For{\(i \leftarrow 0 \) \KwTo \(p - 1\)} {
\For{\(j \leftarrow 2\) \KwTo \(q\)} {
\(B[i][j] \leftarrow \) \KwG{\(B[i][j-1], B[l][z]\)} \;
\tcc{indexes l and z are computed differently for every Argon2 version, see subsection~\ref{argon2versions}}
}
}
\If{\(t > 1\)} {
\For{\(i \leftarrow 0 \) \KwTo \(p - 1\)} {
\(B[i][0] \leftarrow \) \KwG{\(B[i][q-1], B[l][z]\)} \;
\For{\(j \leftarrow 1\) \KwTo \(q\)} {
\(B[i][j] \leftarrow \) \KwG{\(B[i][j-1], B[l][z]\)} \;
}
}
}
\(C \leftarrow B[0][q-1] \oplus B[1][q-1] \oplus \dots \oplus B[p-1][q-1]\) \;
\Return{\KwHapostrophetau{C}}
\caption{Argon2 function algorithm}
\label{argon2alg}
\end{algorithm}
\subsection{Differences in Argon2 versions}
\label{argon2versions}
Argon2 comes in three versions:
\begin{description}
\item[Argon2D] -- data-dependent variant of Argon2. This version uses previously computed data while performing computations and memory access. It is recommended to be used for cryptocurrencies and other proof-of-work applications as well for hashing on back-end servers. In general, this version is suited for an environment where no side-channel timing attacks are expected. It provides better protection against brute-force attacks using specialized hardware and trade-off attacks.
\item[Argon2i] -- data-independent version. This version does not rely on previously computed data while performing calculations in memory. The version is recommended for key derivation and password hashing, where side-channel timing attacks are more probable because an adversary can have physical access to the machine. The mode is slower because it performs multiple passes over memory to defend against trade-off attacks.
\item[Argon2id] -- combination of previous versions. In this mode the function behaves as Argon2i during computation of the first half of the first iteration over the memory. During remaining operation the Argon2d mode is used. The mode combines benefits of mitigation of side-channel timing attacks and brute force attacks.
\end{description}
The difference lies in line number xx in the algorithm~\ref{argon2alg}. In particular indexes \emph{l} and \emph{z} are computed differently within different Argon2 versions. The allocated memory is represented as a matrix \(B[I][J]\) with \emph{p} rows and \(q = m' / p\) columns. Before computing indexes \emph{l} and \emph{z}, values \(J_1\) and \(J_2\) have to be calculated. This is the difference among Argon2 versions. After calculating of these values, further processing needs to be done but this stays the same for all versions.
Rows of memory are called lanes and there are p lanes according to the parameter \emph{p}, which signifies degree of parallelism. Moreover, the memory is divided into 4 vertical slices, where number of slices is denoted as \emph{S}. Every intersection of a lane and a slice is called a segment. Segments belonging to the same slice are computed in parallel and they must not reference blocks of each other.
Argon2d computes \(J_1\) and \(J_2\) based on data stored in memory. Therefore, it is data-dependent.
\(J_1 \leftarrow int32(extract(B[i][j-1], 1))\)
\(J_2 \leftarrow int32(extract(B[i][j-1], 2))\)
Where the function int32(s) converts the 32 bit string s into non-negative integer represented in little endian. The function extract(s, i) extracts ith set 32 bits long from bit string s while s is indexed from 0.
Argon2i computes indexes by running two rounds of the compression function \emph{G} in counter mode in the following way
\begin{eqnarray*}
x &\leftarrow& G(ZERO, LE64(r) || LE64(l) || LE64(s) || LE64(m') || LE64(t) ||\\
&& || LE64(y) || LE64(i) || ZERO)
\end{eqnarray*}
The result x is 64 bits long and therefore it can be viewed as two 32 bit strings.
\(x_1 || x_2 \leftarrow x\)
\(j_1 \leftarrow int32(x_1)\)
\(j_2 \leftarrow int32(x_2)\)
Inputs for the function \emph{G} are described below. Note that no data from memory blocks are used. Therefore, Argon2i is data-independent.
\begin{description}
\item[r] -- the pass number
\item[l] -- the lane number
\item[s] -- the slice number
\item[\(m'\)] -- total number of memory blocks
\item[t] -- total number of passes
\item[y] -- Argon2 mode
\item[i] -- counter starting from 1 in every segment
\item[ZERO] -- string of zeroes, in the first case 1024 bytes long, in the second case 968 bytes long
\end{description}
Argon2id behaves as Argon2i if the pass number is 0 and at the same time the slice number is 0 or 1. In further computations it behaves like Argon2D.
Further computations common for all Argon2 modes are described in the subsection 3.4.2 at \parencite{argon2draft}.
\section{Scrypt}
\label{sec:scrypt}
Scrypt is a PBKDF introduced by Colin Percival in \parencite{memoryhard}. At the time of releasing the paper (2009) problems of PBKDF2 and its parallel computation were already well-known. Percival came with two new concepts. He introduced a concept of memory-hard algorithm and a sequential memory-hard function. Both concepts came as a reaction to increasing computing power of parallel hardware. He also defined the ROMix and SMix functions which are sequential memory hard functions and they perform mixing of data blocks in memory.
Scrypt accepts the memory/CPU cost parameter \emph{n}, the parallelization parameter \emph{p} and the block size parameter \emph{r}. The idea behind the Scrypt function can be described as follows:
\begin{enumerate}
\item Use PBKDF2 with PRF being SHA256 to process the passphrase and salt provided as Scrypt input and generate \emph{p} blocks. Their size is affected by parameter \emph{r}.
\item Blocks generated in the previous step are independently mixed by a mixing function which is actually composed of ROMix function, SMix function and 8-round version of Salsa20 stream cipher function. This step makes whole computation expensive.
\item The result is used as salt for another PBKDF2 computation which produces the final result.
\end{enumerate}
There exist theoretical work describing cache timing side-channel attack against Scrypt but it has not been put into practice yet \parencite{scryptattack}.
\section{Other PBKDFs}
The Password Hashing Competition received 24 submissions. The panel chose 9 finalists and finally Argon2 as the winner. Four finalists which received special recognition by PHC panel are described in this section.
The Catena is presented as a password-scrambling framework \parencite{catena}. It can be used for many purposes including password derivation or proof of work. Catena instance is characterized by cryptographic primitive, a reduced primitive, optional randomization layer and a memory-hard function. These parts are independent on each other and therefore the whole platform is somehow modular. Three tunable parameters are provided, affecting time and memory requirements and also resistance against precomputation attacks.
An interesting feature is introduced by server relief protocol which allows shifting actual computation from a server to a client, while keeping memory and time requirements. For example authentication servers could benefit from this feature by not being burdened by too many authentication requests. Another interesting feature is option to increase security parameters (recomputing hash) without knowledge of the actual password.
The Lyra2 function was chosen for its usage of cryptographic sponges \parencite{lyra2}. The function accepts parameters for tuning of memory and time cost. The function supports parallelism on multi-core CPU platforms but tries to minimize benefits of parallelism on GPUs. Different underlying sponge functions can be used.
The main highlight of the Makwa function is its ability to delegate its computation to untrustworthy clients \parencite{maqa}. This feature is similar to the one provided by Catena. A feature which makes this function unique is support for password escrow. This feature allows recovering of a password by knowing a specific private key. This can be used for example as a password recovery, although this method is definitely not ideal and should be considered carefully.
The Yescrypt function is based on the Scrypt function briefly described in section~\ref{sec:scrypt} \parencite{yescrypt}. It is therefore backward compatible with Scrypt and can compute Scrypt hashes. It also offers delegation of computation to external clients and hash upgrades without knowledge of the password. The running time can be increased separately from the memory usage and parallelism.
\chapter{Analysis of LUKS2}
\label{chap:analysis}
This chapter deals with the actual practical research performed as a part of the thesis. The important goal of this thesis is to create a price model for potential attacker trying to get unauthorized access to LUKS2 encrypted partition. Another goal is tightly related. As explained in section~\ref{luks2pbkdf}, LUKS2 can determine three parameters for Argon2 through its own benchmarking function. It is highly probable that most of Cryptsetup users do not provide their own Argon2 parameters and therefore use the benchmarking function.
Therefore, before the actual model is created, it is necessary to collect sufficient data about parameters used as a result of this benchmark under various real world conditions. The process and results are described in section~\ref{sec:benchmark}. These parameters are further used while simulating attacks on LUKS2 encrypted volumes on real machines. This is described in section~\ref{sec:attack}. Results are used to create an actual price model in chapter~\ref{chap:model}.
\section{LUKS}
LUKS stands for Linux Unified Key Setup. This project started being developed by Clemens Fruhwirth as a reaction to several incompatible disk encryption schemes which coexisted at the same time at the beginning of 21st century. At certain point there existed three incompatible disk encryption schemes which varied from Linux distribution to Linux distribution. If a user had created an encrypted disk, they could not be sure if they will be able to encrypt the disk with a different distribution or even with a new version of the same distribution.
LUKS began as a metadata format for storing information about cryptographic key setup. However, Fruhwirth discovered that to design a proper metadata format, he needs to know enough information about key setup process \parencite{newmethods}. Therefore, he created TKS1 and TKS2. These are templates for the key setup process. Together with LUKS they ensure safe and standardized key management during disk encryption. After some user feedback, LUKS on-disk specification version 1.0 was created in 2005 \parencite{luks1}. Currently the latest version is LUKS on-disk specification version 2 \parencite{luks2}.
The reference implementation of both versions of LUKS is called Libcryptsetup. The user space interface is called Cryptsetup. In the following two subsections, default parameters and information concerning command line switches are specific to this implementation.
\subsection{Usage of PBKDFs in LUKS version 1}
\label{luks1}
PBKDF2 function is used as a key derivation function in LUKS version 1. It is used during master key initialization, adding of a new password, master key recovery, and also during password changing because this operation is actually composed of previously mentioned operations. During all operations it internally uses a hash algorithm specified by user during initialization of the LUKS header. By default, SHA256 algorithm is used.
During initialization, the PBKDF2 function is used to create a checksum of a master key. This key is subsequently used for symmetric encryption of actual data stored on the encrypted disk. The function receives following parameters:
\begin{description}
\item[masterKey] -- a new randomly generated master key of user specified length
\item[phdr.mk-digest-salt] -- a cryptographic salt 32 bytes long which is used to prevent attacks against password using precomputed tables, see subsection~5.6.3 at \parencite{itmc14}
\item[phdr.mk-digest-iteration-count] -- number of iterations for PBKDF2, see section~\ref{sec:pbkdf2}
\item[LUKS\-DIGEST\-SIZE] -- length of he computed digest in bytes, default is 20
\end{description}
The generated 20 bytes long checksum is stored in the LUKS header together with the iteration count and salt. Please note that the \emph{phdr.mk-digest-iteration-count} parameter is obtained by performing a benchmark with minimum of 1,000 iterations.
During adding of a new password, PBKDF2 is used to process the passphrase supplied by user. It receives following parameters:
\begin{description}
\item[password] -- a passphrase supplied by user
\item[ks.salt] -- randomly generated salt used for this particular keyslot with length of 32 bytes
\item[ks.iteration-count] -- number of PBKDF2 iterations
\item[MasterKeyLength] -- length of the derived key
\end{description}
The resulting key derived from the passphrase is used to encrypt the master key. This key has to be present in memory either because initialization has happened recently or it has been successfully recovered through a different key stored in different keyslot. Together with the encrypted master key, the salt and iteration count are also written into the keyslot.
Note that the ks.iteration-count parameter can be influenced by user in several ways \parencite{cryptsetupmanual}. One possibility is to specify number of iterations directly with the command line option \verb+--pbkdf-force-iterations+. Another option is to specify the iteration time through \verb+-i+ or \verb+--iter-time+ command line options. This option expects a number which signifies number of milliseconds which should be spent unlocking the keyslot. A benchmark is used to calculate number of iterations which corresponds to this time. If no option is specified then the default iteration time of 2,000 milliseconds is used.
The function is also used during the master key recovery. This process is performed while unlocking the encrypted partition. During key recovery the PBKDF is actually used twice for every keyslot until the master key is decrypted or there are no more keyslots to try. First it is used to derive a decryption key from a passphrase supplied by a user. Then this decryption key is used to decrypt the encrypted master key stored in the current keyslot. The result is called the candidate master key because we are still not sure if the passphrase was correct. This candidate is again hashed with PBKDF and finally compared with hash of the master key stored in the LUKS header. If hashes match then the passphrase was entered correctly and the master key can be used.
In the first case the function receives following parameters:
\begin{description}
\item[pwd] -- user supplied passphrase
\item[ks.salt] -- the salt value read from currently tried keyslot
\item[ks.iteration-count] -- the iteration count read from currently tried keyslot
\item[masterKeyLength] -- length of the derived key
\end{description}
In the second case the function receives following parameters:
\begin{description}
\item[masterKeyCandidate] -- candidate master key, see above
\item[ph.mk-digest-salt] -- the salt value which was used during the initialization phase and stored in the header
\item[ph.mk-digest-iter] -- number of PBKDF2 iterations used during the initialization phase and also stored in the header
\item[LUKS\_DIGEST\_SIZE] -- 20 bytes
\end{description}
The process of changing a password is composed of previously mentioned operations and hence I am not mentioning it here in greater details. To sum it up, firstly the master key is recovered, then a new password is added to a new keyslot and the previous one is revoked.
Both password-based encryption and password checking require additional cryptographic primitives which process the derived key. For encryption, reference implementation of LUKS uses AES-XTS-plain64 and for hashing it uses sha256. Usage of PBKDF2 requires underlying pseudorandom function. In case of LUKS version 1, default PRF is SHA256. Alternatively, it is possible to choose SHA512, ripemd160 or whirlpool.
\subsection{Usage of PBKDFs in LUKS version 2}
\label{luks2pbkdf}
LUKS version 2 extends LUKS version 1 and uses similar principles. Therefore, I will focus on differences between version 1 and 2 which are related to usage of PBKDFs. For detailed list of changes see the section 1.1 at \parencite{luks2}.
The new version supports configurable algorithms for encryption, hashing and also key derivation. That means that the set of algorithms can be extended as new are developed and other are obsoleted. Note that there still exist some requirements for algorithms provided by cryptographic back-end \parencite{luks2}. The back-end has to support SHA-1 and SHA-256 hashing functions, PBKDF2, Argon2i and Argon2id key derivation functions and AES-XTS symmetric encryption.
LUKS2 introduces PBKDF memory hard functions Argon2i and Argon2id which are described in~\ref{argon2}. Argon2 functions should offer increased resistance to brute-force attacks.
The volume key digest is no longer limited by length of 20 bytes, because it no longer relies on SHA-1 hashing function. The processes described in subsection~\ref{luks1} are the same in LUKS version 2. The same applies for situations in which PBKDFs are used.
As stated in subsection~\ref{luks1}, in case of PBKDF2 user can influence number of iterations directly or specify approximate time required for processing of the passphrase. This stays the same for LUKS2. Functions from Argon2 family introduce two additional parameters; memory cost and parallel cost. Both parameters can be specified through \verb+--pbkdf-memory+ and \verb+--pbkdf-parallel+ command line parameters respectively. The number of iterations is either benchmarked or it can be manually specified through \verb+--pbkdf-force-iterations+.
The benchmarking function takes as an input a desired unlocking time specified in milliseconds. The function tries to find the best parameters while not exceeding this unlocking time. Note that the number of parallel threads will be always at most 4 and it will decrease if not enough CPUs are online at the time of its usage. In Cryptsetup version 2.1.0 the default unlocking time is set at 2,000 milliseconds and the default memory cost is set at 1,048,576 KiB. Both default values can be changed at compilation time.
\section{Benchmarking of LUKS2 and Argon2 parameters}
\label{sec:benchmark}
In \ref{luks2pbkdf} three parameters for Argon2 were mentioned. They have crucial impact on resistance of the resulting hash against dictionary or brute-force attacks. Their effects are described in detail in section~\ref{argon2}. From the Cryptsetup's point of view, these parameters impact not only security but user experience as well. If they are set to too low values, the security of encryption keys can be endangered. If they are set to too high values, the unlocking process can last undesirably long time or excessive amount of computing power and memory can be used. Cryptsetup does not suppose that all users know exact implications of choosing particular parameters. Therefore, it introduces a benchmarking function. Note that advanced users can still choose parameters manually.
The benchmarking function takes as an input the desired unlocking time specified in milliseconds. It starts with default parameters and performs several PBKDF computations. By increasing and decreasing parameters during computations it tries to find the most secure (the highest) parameters while not exceeding the unlocking time. Additionally, the function performs some estimations to reduce number of required computations because the benchmarking process should not last too long. Due to this approach the resulting parameters are dependent on several conditions, in particular available hardware and current workload of the system.
Currently there exist some limitations which are hard-coded into Cryptsetup. The minimal number of iterations for Argon2 is 4, maximum number is limited by limit of unsigned 32 bit integer for the given platform. The minimum memory parameter is 32 KiB, maximum is 4 GiB. The minimal degree of parallelism for Argon2 is 1, maximum is 4. All the limits are hard-coded in \parencite{cryptsetupgitpbkdfcheck}. Note that the number of parallel threads will be always at most 4 and it will decrease if not enough CPUs are online at the time of the benchmark.
In Cryptsetup version 2.1.0 the default unlocking time is set at 2,000 milliseconds. The default Argon2 memory cost is set at 1,048,576 KiB. Both default values can be changed at compilation time.
\subsection{Benchmarking tool}
As a part of my thesis I performed collection of Argon2 parameters estimated by the function mentioned above. I created a small tool which can perform series of benchmarks with given parameters and output results in human-readable or machine-readable format. Additionally, the tool measures time spent performing the benchmark. Although this information is not directly related to the topic of the thesis, it might be useful in the future. The source code of the tool is part of the thesis \parencite{thesisrepo}.
The tool accepts several parameters. It is possible to specify desired unlocking time, maximum degree of parallelism, maximum memory cost and number of benchmarks performed.
The tool can output results in a text format or in a CSV format including headers for better machine processing. The tool does not perform actual benchmarking, it utilizes API of Libcryptsetup library. Therefore, the library is required for it to work. The tool does not create any Cryptsetup volumes and does not require root privileges.
\subsection{Methodology of collecting of real world parameters}
\label{subsec:laptop}
I used the previously mentioned tool to collect real data on real physical hardware. I tried to simulate various environments in which Cryptsetup can be used to create and unlock encrypted volumes. I decided to use real hardware and not virtual machines because of possible inaccuracies during measurement caused by virtualization layer.
The first hardware configuration consisted of Lenovo Thinkpad P50 laptop with Intel Core i7-6820HQ CPU @ 2.70GHz with 4 cores, 8 threads. The laptop was equipped with 32 GiB of SODIMM DDR4 Synchronous 2,133 MHz (0,5 ns) RAM \parencite{laptopspecs}.
The second configuration comprised Raspberry Pi 3, model B+. This device was equipped with quad core Broadcom CM2837B0, Cortex-A53 (ARMv8) 64-bit SoC @ 1.4GHz processor and 1 GiB LPDDR2 SDRAM memory \parencite{raspberryspecs}. Note that parameters of this device are considerably lower but there still might be scenarios in which a user might wish to perform drive encryption utilizing this device (network attached storage, home media server). The device was running Raspbian GNU/Linux 9 (stretch).
I have performed following steps on both devices:
\begin{enumerate}
\item install required libraries and headers
\item download and unpack the source code of Cryptsetup version 2.1.0
\item configure and compile Cryptsetup without support for udev and blkid to minimize required dependencies, see below for remarks concerning configuration options
\item compile the benchmarking tool linking it to the Libcryptsetup compiled in the previous step
\item if desired, simulate hardware limitations (amount of available CPUs, amount of available memory)
\item run series of benchmarks over unlocking times of 1,000, 2,000, 3,000, 4,000, 5,000, 10,000 and 20,000 milliseconds with 1, 2, 3 and 4 parallel threads with every benchmark repeated 100 times
\end{enumerate}
During compilation, I decided to run the configure script with \verb+--disable-udev+ and \verb+--disable-blkid+ options. They disable support for udev and device signature detection through blkid. These features were not needed during benchmarking, and they caused complications while installing necessary development libraries.
Simulating various hardware conditions by limiting CPU performance and available memory appeared to be an interesting challenge. My goal was to simulate different number of available CPU cores and different amount of available RAM memory. I wanted to collect results from various combinations of those parameters.
At first, I tried to use Control Groups 2 feature of Linux kernel \parencite{cgroups2}. Unfortunately, I was not able to get desired results. In particular, I was not able to guarantee that the benchmarking process will really use limited set of CPU cores. Then I tried to use the Cpulimit project \parencite{cpulimit}. Unfortunately I still was not able to guarantee that the benchmarking function uses only certain number of CPU cores.
At last, I found the solution for limiting number of available CPU cores. I was able to disable individual cores by modifying the file
\begin{tt}
/sys/devices/system/cpu/cpu\emph{N}/online
\end{tt}
Where the \emph{N} signifies the number of the core to be turned off. Note that this method allowed me to change state not only of individual cores, but also of individual virtual cores created with hyper-threading technology.
This method could not be applied to raspberry, as the needed file had not existed, probably because of difference of underlying hardware platform. Therefore, I limited number of available CPUs by modifying the
\begin{tt}
maxcpus=\emph{n}
\end{tt}
kernel parameter. This device did not offer hyper-threading.
Due to the way in which the Cryptsetup detects available amount of memory, I decided to perform modification to Cryptsetup itself \parencite{cryptsetuputils}. I decided to modify directly the function \emph{crypt\_getphysmemory\_kb} which returns available physical memory in KiB. I modified it so that it reads the value from an environment variable, see the patch in the thesis git repository \parencite{thesisrepo}.
After finishing the benchmark, I used the Datamash tool to perform basic statistic computations on resulting CSV data \parencite{datamash}. Raw data as well as Datamash results can be found in the thesis Git repository \parencite{thesisrepo}.
\FloatBarrier
\subsection{Collected benchmarking results}
Following tables present chosen results gained through methodology described in the previous subsection. Following conventions in table headers are used:
\begin{description}
\item[time] -- requested unlocking time in milliseconds
\item[avg(T)] -- average benchmarked Argon2 time cost expressed as a number of iterations
\item[stdev(T)] -- standard deviation of benchmarked Argon2 time cost expressed as a number of iterations
\item[avg(M)] -- average benchmarked Argon2 memory cost expressed in KiB
\item[stdev(M)] -- standard deviation of benchmarked Argon2 memory cost expressed in KiB
\end{description}
The table \ref{tab:l8c32g} serves as an introduction and it shows what parameters to expect. The most important is the row concerning unlocking time of 2,000 milliseconds as these values were used as baseline for simulating of attacks in the next chapter. Tables \ref{tab:l4c4g} and \ref{tab:l4c2g} demonstrate hardware limitations affecting Argon2 parameters. See the Git repository for more combinations of CPU and memory limitations \parencite{thesisrepo}.
\noindent
\begin{table}
\caption{Benchmark results for laptop with 8 threads and 32 GiB of memory}
\label{tab:l8c32g}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
time in ms & avg(t) in iterations & stdev(t) in iterations & avg(M) in KiB & stdev(M) in KiB\\
\hline
1,000 & 4 & 0 & 804,032.75 & 3,886\\
\hline
2,000 & 6 & 0 & 1,048,576 & 0\\
\hline
3,000 & 9 & 0 & 1,048,576 & 0\\
\hline
4,000 & 12 & 0 & 1,048,576 & 0\\
\hline
5,000 & 15 & 0 & 1,048,576 & 0\\
\hline
10,000 & 31.01 & 0.099 & 1,048,576 & 0\\
\hline
20,000 & 64 & 0 & 1,048,576 & 0\\
\hline
\end{tabularx}
\end{table}
\noindent
\begin{table}
\caption{Benchmark results for laptop with 4 cores and 4 GiB of memory}
\label{tab:l4c4g}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
time in ms & avg(t) in iterations & stdev(t) in iterations & avg(M) in KiB & stdev(M) in KiB\\
\hline
1,000 & 4 & 0 & 785,173.44 & 4,532\\
\hline
2,000 & 5 & 0 & 1,048,576 & 0\\
\hline
3,000 & 9 & 0 & 1,048,576 & 0\\
\hline
4,000 & 12 & 0 & 1,048,576 & 0\\
\hline
5,000 & 15 & 0 & 1,048,576 & 0\\
\hline
10,000 & 31 & 0 & 1,048,576 & 0\\
\hline
20,000 & 61.97 & 0.298 & 1,048,576 & 0\\
\hline
\end{tabularx}
\end{table}
\noindent
\begin{table}
\caption{Benchmark results for laptop with 4 cores and 2 GiB of memory}
\label{tab:l4c2g}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
time in ms & avg(t) in iterations & stdev(t) in iterations & avg(M) in KiB & stdev(M) in KiB\\
\hline
1,000 & 4 & 0 & 781,944 & 1,715.758\\
\hline
2,000 & 6 & 0 & 1,048,576 & 0\\
\hline
3,000 & 9 & 0 & 1,048,576 & 0\\
\hline
4,000 & 12 & 0 & 1,048,576 & 0\\
\hline
5,000 & 15 & 0 & 1,048,576 & 0\\
\hline
10,000 & 31.26 & 0.438 & 1,048,576 & 0\\
\hline
20,000 & 63.57 & 1.041 & 1,048,576 & 0\\
\hline
\end{tabularx}
\end{table}
\FloatBarrier
Tables \ref{tab:comp2000} and \ref{tab:comp1000} compare benchmarking results for unlocking time of 1,000 and 2,000 ms respectively. Cryptsetup was compiled with different options explained below:
\begin{description}
\item[none] -- no special options were added
\item[SSE] -- the \verb+--enable-internal-sse-argon2+ configuration option was added. If this option is specified, additional checks for special CPU features are performed (SSE2, SSE3, AVX2, AVX512). If they are detected, Argon2 is compiled with support for the most advanced one.
\item[native] -- the \verb+-march=native+ flag was passed to the compiler, optimizing the source code for the particular platform
\item[external] -- Cryptsetup was compiled against external Argon2 library
\end{description}
\noindent
\begin{table}
\caption{Comparing results for various Cryptsetup compilation options with unlocking time of 2,000 ms}
\label{tab:comp2000}
\begin{tabularx}{\textwidth}{| l | X | X | X | X |}
\hline
flags & avg(t) in iterations & stdev(t) in iterations & avg(M) in KiB & stdev(M) in KiB\\
\hline
none & 6 & 0 & 1,048,576 & 0\\
\hline
SSE & 7 & 0 & 1,048,576 & 0\\
\hline
native & 6 & 0 & 1,048,576 & 0\\
\hline
native + SSE & 11 & 0 & 1,048,576 & 0\\
\hline
external & 6 & 0 & 1,048,576 & 0\\
\hline
\end{tabularx}
\end{table}
\noindent
\begin{table}
\caption{Comparing results for various Cryptsetup compilation options with unlocking time of 1,000 ms}
\label{tab:comp1000}
\begin{tabularx}{\textwidth}{| l | X | X | X | X |}
\hline
flags in ms & avg(t) in iterations & stdev(t) in iterations & avg(M) in KiB & stdev(M) in KiB\\
\hline
none & 4 & 0 & 804,032.75 & 3,886.047\\
\hline
SSE & 4 & 0 & 949,740.83 & 2,305.669\\
\hline
native & 4 & 0 & 863,822.25 & 1,414.500\\
\hline
native + SSE & 5 & 0 & 1,048,576 & 0\\
\hline
external & 4 & 0 & 846,713.44 & 1,359.307\\
\hline
\end{tabularx}
\end{table}
Notice that combining optimized versions of Argon2 with compilation for the particular platform produces the best results. It almost doubles the number of iterations compared with the benchmark without these optimizations and it produces only one less iteration compared to unlocking time of 4,000 milliseconds benchmarked without optimization. Unfortunately, such combination of configuration options is not very common. A quick review of several Linux distributions (Debian, Ubuntu, Fedora) shows that distributions package the Argon2 library separately and link the Cryptsetup against it. Also, due to the support of the broadest array of devices possible, platform-specific optimizations are not used.
The table \ref{tab:r4c1g} shows benchmarking results for the Raspberry Pi. Notice that the memory cost is significantly lower. It is important to note here that the benchmarking algorithm of Cryptsetup will never use more than half of available amount of physical memory to prevent out of memory killer from killing the actual benchmarking process. Note also that the algorithm tries to use maximum available memory first and then tries to increase number of iterations.
\noindent
\begin{table}
\caption{Benchmark results for Raspberry Pi with 4 cores and 1 GiB of memory}
\label{tab:r4c1g}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
time in ms & avg(t) in iterations & stdev(t) in iterations & avg(M) in KiB & stdev(M) in KiB\\
\hline
1,000 & 4 & 0 & 43,291.74 & 3,628\\
\hline
2,000 & 4 & 0 & 95,296.58 & 8,199\\
\hline
3,000 & 4 & 0 & 150,860.59 & 10,508\\
\hline
4,000 & 4 & 0 & 203,604.14 & 6,723\\
\hline
5,000 & 4 & 0 & 255,361.51 & 10,598\\
\hline
10,000 & 4 & 0 & 474,446.97 & 2,737\\
\hline
20,000 & 7.81 & 0.879 & 474,722 & 0\\
\hline
\end{tabularx}
\end{table}
By comparing tables \ref{tab:r4c1g} and \ref{tab:r4c1go} it can be seen that there is not a significant difference in displayed values. The values from the table \ref{tab:r4c1g} were benchmarked after compilation with \verb+-march=native+ flag. The \verb+--enable-sse-internal-argon2+ option cannot be used on Raspberry Pi, because the ARM architecture does not offer this kind of CPU instructions. It also confirms the premise that Argon2 should be difficult to optimize for other platforms than X86.
\noindent
\begin{table}
\caption{Benchmark results for Raspberry Pi with optimizations}
\label{tab:r4c1go}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
time in ms & avg(t) in iterations & stdev(t) in iterations & avg(M) in KiB & stdev(M) in KiB\\
\hline
1,000 & 4 & 0 & 25,639.15 & 4,745.422\\
\hline
2,000 & 4 & 0 & 53,143.46 & 1,493.866\\
\hline
3,000 & 4 & 0 & 82,432.72 & 2,296.222\\
\hline
4,000 & 4 & 0 & 111,220.07 & 1,818.082\\
\hline
5,000 & 4 & 0 & 140,627.5 & 1,984.865\\
\hline
10,000 & 4 & 0 & 292,344.17 & 1,726.421\\
\hline
20,000 & 4.17 & 0.375 & 474,722 & 0\\
\hline
\end{tabularx}
\end{table}
\FloatBarrier
\section{Attacking LUKS2}
\label{sec:attack}
This section describes the rest of the practical research. The next chapter then defines and presents the price model which is based on results collected in this chapter. The following three subsections describe utilized hardware and software. The last subsection presents the results.
The motivation behind performing of experimental simulations described in this chapter was to make the price model more precise and realistic. I found some results of benchmarks of Argon2 on CPU or GPU \parencite{argon2gpuold}, they are, however, almost four years old. Since that time there appeared new technologies in GPU and CPU segment and the Argon2 went through several upgrades. Another reason was that by using a real hardware I was able to fine-tune parameters such as number of CPU cores, amount of available memory, available specialized CPU instructions etc.
\subsection{Utilized hardware}
For the purpose of this thesis I chose to utilize machines offered by the MetaCentrum VO organization because they are easily accessible to registered students, and they offer diverse spectrum of computing resources. This proved useful especially considering the fact that according to the definition of an attacker in section~\ref{sec:attacker} it is expected to utilize as powerful machines as possible. MetaCentrum VO is a virtual organization supporting all members of Czech National Grid Organization. It manages and operates distributed computing and storage resources owned by Cesnet as well as other cooperating academic organizations in Czech Republic.
The following list describes parameters of the most interesting machines used during the research. For description of all the machines offered in Metacentrum see \parencite{metacentrumhw}.
\begin{description}
\item[Doom] -- $2 \times 8$ cores Intel Xeon E5-2,650v2 2.60 GHz, hyper-threading 2 threads per core, supporting SSSE3 instruction, 64 GiB RAM, Nvidia Tesla K20 with 4,743 MiB of memory
\item[Konos] -- $2\times10$ cores Intel Xeon CPU E5-2630 v4 at 2.20 GHz, hyper-threading 2 threads per core, supporting AVX2 instruction, 128 GB of RAM, Nvidia GeForce GTX 1080 Ti with 11,178 MiB of GPU memory
\item[Nympha] -- $2\times16$ cores Intel Xeon Gold 6130 CPU at 2.10 GHz, hyper-threading 2 threads per core, supporting AVX512F instruction, 192 GiB of RAM
\item[Manegrot] -- $4\times$ 8-core Intel Xeon E5-4627v2 at 3.30GHz, no hyper-threading, supporting SSSE3 instruction, 512 GiB of RAM
\item[Uruk] -- $8\times18$ cores Intel Xeon Gold 6140 3.70 GHz, hyper-threading 2 threads per core, supporting AVX512F instruction, 2.86 TiB of RAM
\item[White] -- $2\times12$ core Intel Xeon Gold 6138 2.0 GHz, no hyper-threading, supporting SSSE3 instruction, 512 GiB of RAM, GPU Tesla P100 with 16,280 MiB of memory
\item[Zubat] -- $2\times8$ core Intel Xeon E5-2630v3 2.40GHz, hyper-threading 2 threads per core, supporting SSSE3 instruction, 128 GiB of RAM, GPU Tesla K20Xm with 5,700 MiB of memory
\end{description}
While performing computations on CPU I always allocated whole computing host and requested maximum number of available CPUS. I always requested at least $(T/L) \times M$ GiB of memory where \emph{T} denotes number of available CPU threads, \emph{L} denotes degree of parallelism for instance of Argon2 being computed and \emph{M} denotes memory cost. While simulating attacks on GPU I always reserved one instance of GPU, 1 CPU and 1 GiB of RAM memory. Reservations of GPUs are always exclusive. All tests used SSDs as underlying storage.
\subsection{Naive method}
\label{subsec:naive}
The first chosen attack method is naive but simple. It requires Libcryptsetup library to be present because it directly utilizes the API. The main idea is to try to recover the master key through normal means but to perform this action in multiple parallel threads.
I created the tool which performs this attack. The tool was based on proof of concept called Dict\_search \parencite{cryptsetupdictsearch}. I modified it so that it uses threads instead of processes and it measures every recovery attempt and provides statistics. I also improved its usability. The tool can be found in the thesis Git repository \parencite{thesisrepo}.
The tool expects following inputs where letters denoting parameters correspond to the command line options:
\begin{description}
\item[t] -- type of encrypted volume, so far LUKS1, Truecrypt and LUKS2 are recognized and other types supported by Cryptsetup can be easily added by adding appropriate handler into the source code
\item[i] -- base name of input device files
\item[p] -- input password file with one password per line
\item[T] -- number of parallel threads
\end{description}
The tool performs basic validation of input parameters and launches defined number of threads. Every thread receives a handler to the specified file containing the encrypted partition. Every thread opens the file with passwords and starts attempting to unlock the keyslot with passwords read from the input file. The file is internally divided into segments and every thread reads a different segment. The tool runs until the passphrase is found or all the passphrases in the input file had been tried.
Memory footprint of this method can be expressed as $M \times T$, where \emph{M} denotes the memory cost parameter used during creation of the encrypted partition and \emph{T} denotes number of cracking threads specified while launching the tool. The number of utilized CPU threads can be expressed as $P \times T$, where \emph{P} denotes the degree of parallelism of Argon2 used during initialization of the encrypted drive and \emph{T} is number of parallel cracking threads.
Because an attacker would try to speed up the cracking process as soon as possible, I compiled Cryptsetup with support for optimized Argon2 using the \verb+--enable-sse-internal-argon2+ and I applied a simple patch which disabled clearing of internal Argon2 memory after every Argon2 computation. The patch can be found in the Git repository of the thesis \parencite{thesisrepo}.
\subsection{Method based on Argon2-gpu-bench tool}
This method utilizes a tool called Argon2-gpu-bench written by Ondrej Mosnáček \parencite{argon2gpu}. The tool was originally created to benchmark speed and energy consumption of Argon2 computing hashes. The background scenario was offline brute-force attack against Argon2 hash, which overlaps with goals of my thesis. That is the reason why I selected this tool.
Note that this is not a password guessing tool in its current implementation. It only computes provided Argon2 hashes. Originally the tool generated random strings 64 bytes long. I modified the tool so that strings are loaded from provided file containing one password per line. My modifications can be found at \parencite{argon2gpuvojta}.
The scenario expects that an attacker would use this tool to precompute Argon2 hashes and later, possibly in parallel, would use hashes as decryption keys to recover the master key. The right password could be verified by decrypting several sectors of a volume with recovered master key and searching for well-known signatures. Note that AES used during decryption of sectors can be definitely parallelized and there exist special CPU instructions to improve its speed. Although the time required to try all the candidate hashes as decryption keys is not negligible, it is definitely smaller and less memory-intensive compared to Argon2 computations.
The tool can compute Argon2 using three different modes. It can utilize standard CPU implementation and automatically choose the most optimized one according to supported CPU instructions. Clearing of internal memory is disabled for performance reasons. It also contains Argon2 implementation for OpenCL platform and three different GPU kernels (implementations) to be used with CUDA platform. Every GPU kernel handles the GPU memory differently, possibly producing different results under different parameters.
The \emph{Master} kernel uses only shared memory of GPU. The \emph{Warpshuffle} kernel uses only GPU registers. The \emph{Warpshuffleshared} kernel uses combination of both. Additionally, while running in GPU or OpenCL mode, it is possible to select two modes of operation. The \emph{Oneshot} computes individual Argon2 computation as one GPU / OpenCL task, whereas \emph{By-segment} mode divides computation of Argon2 slices among individual GPU / OpenCL tasks.
Argon2-gpu-bench can also precompute values for Argon2i and Argon2id, speeding up the process. This feature is available only for CUDA and OpenCL modes.
Following list describes chosen input parameters for the application. Letters match command line options.
\begin{description}
\item[L] -- Argon2 degree of parallelism
\item[M] -- Argon2 memory cost parameter
\item[T] -- number of Argon2 iterations
\item[s] -- number of computations
\item[b] -- number of hashes computed in every computation (batches)
\end{description}
Please notice the \emph{s} and \emph{b} parameters, they are important while interpreting results collected through this tool. The tool performs \emph{s} consecutive discrete computations. Every such computation is measured separately and the tool outputs its length including time of writing and reading of data for OpenCL and CUDA modes. In every such phase \emph{b} hopefully parallel hash computations is performed. I say hopefully because it is not always possible or desirable. Length of every such small computation is measured and averaged at the end of the whole benchmark. So at the end the tool performs $s \times b$ computations.
The tool tries to run as many parallel threads as possible. The number of such threads is determined as $b \times L$. The amount of needed memory is determined as $b \times M$. However, there are upper limits for both values considering effectiveness of calculations. While computing on CPU through standard Argon2 CPU implementation or OpenCL platform, the number of available CPU threads should be effectively the limit. While performing computations on GPU through OpenCL or CUDA, the tool cannot run if amount of memory required is higher than amount of memory available at the GPU.
\FloatBarrier
\subsection{Results}
Here I describe and compare results collected by two methods described in previous subsections. These numbers serve as direct basis for the price model described in the next chapter. The following list shows column headings used in tables with their description and appropriate units:
\begin{description}
\item[threads] -- number of threads passed as parameter to cracker
\item[batch size] -- number of concurrent Argon2 computations passed through \verb+-b+ parameter to the Argon2-gpu-bench tool
\item[avg(time per hash)] -- average time spent calculating one hash in milliseconds
\item[stdev(time per hash)] -- standard deviation of time spent computing one hash in milliseconds
\item[avg(time per batch)] -- average time spent computing one batch of hashes in milliseconds
\item[stdev(time per batch)] -- standard deviation of time spent computing one batch of hashes in milliseconds
\end{description}
Tables \ref{tab:nymphacracker}, \ref{tab:nymphacpu} and \ref{tab:konosgpu} show us a baseline of results. Nympha and Konos were the most powerful machines for computing Argon2 I managed to find in MetaCentrum. We can see that cracking hashes with high memory parameter (1,049,576 KiB) is highly ineffective with the naive approach. The best result was achieved while running two concurrent unlocking operations, therefore occupying 8 threads. This configuration enabled computation of 1 hash per 1.107 seconds.
However, using powerful enough CPU supporting AVX512F instructions it is possible to get more interesting results. The table \ref{tab:nymphacpu} is very interesting in particular. As we can see, we can get to the speed of 1 hash per 215 milliseconds, that is 4.651 hashes per second. From the theoretical point of view, the most effective computation should occur while running 16 concurrent Argon2 computations using $16 \times 4 = 64$ threads, as Nympha machine offers 64 CPU threads. As we can observe, at this degree of parallelism one hash can be computed in 296.3 milliseconds. However, increasing degree of parallelism up to 512 decreases the time up to 215 milliseconds. This is probably caused by desynchronization of threads during the computing process because it is not possible to have 512 threads running at the same time if every thread requires 1 GiB of memory and the machine has only 180 GiB of memory. Apparently threads get managed by operating system scheduler so that performance is not degraded.
Considering GPU it is possible to get even slightly better times. The GPU provided by Konos machine can run at most 10 parallel Argon2 computations calculating one hash in circa 307 milliseconds, that is 3.25 hashes per second. The table \ref{tab:konosgpuopt} shows us that thanks to precomputing values for Argon2i it is possible to reach 294.5 milliseconds per hash. All values in table \ref{tab:konosgpuopt} were computed with $\emph{b} = 10$.
\noindent
\begin{table}
\caption{Cracker running on Nympha against drive benchmarked on laptop with unlocking time of 2,000 ms}
\label{tab:nymphacracker}
\begin{tabularx}{\textwidth}{| X | X | X |}
\hline
threads & avg(time per hash) in ms & stdev(time per hash) in ms\\
\hline
1 & 1,584.767 & 50.545\\
\hline
2 & 1,634.917 & 82.044\\
\hline
4 & 1,906.227 & 106.585\\
\hline
8 & 2,502.031 & 178.339\\
\hline
16 & 4,302.858 & 385.439\\
\hline
32 & 8,482.634 & 816.596\\
\hline
64 & 17,180.945 & 1,810.248\\
\hline
128 & 34,929.874 & 3,893.617\\
\hline
\end{tabularx}
\end{table}
\noindent
\begin{table}
\caption{Argon2-gpu-bench running on Nympha in CPU mode against parameters benchmarked on laptop with unlocking time of 2,000 ms}
\label{tab:nymphacpu}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
batch size & avg(time per hash) in ms & stdev(time per hash) in ms & avg(time per batch) in ms & stdev(time per batch) in ms\\
\hline
1 & 1,641.408 & 29.667 & 1,641.408 & 29.667\\
\hline
2 & 893.785 & 21.578 & 1,787.570 & 43.156\\
\hline
4 & 525.360 & 10.351 & 2,101.441 & 41.407\\
\hline
8 & 351.323 & 10.219 & 2,810.587 & 81.753\\
\hline
16 & 296.345 & 4.376 & 4,741.530 & 70.018\\
\hline
32 & 252.618 & 2.967 & 8,083.808 & 94.964\\
\hline
64 & 235.185 & 2.112 & 15,051.856 & 135.227\\
\hline
128 & 224.986 & 2.046 & 28,798.256 & 261.982\\
\hline
256 & 219.888 & 1.691 & 56,291.514 & 433.037\\
\hline
512 & 217.429 & 1.660 & 111,323.760 & 850.345\\
\hline
\end{tabularx}
\end{table}
\noindent
\begin{table}
\caption{Argon2-gpu-bench running on Konos in CUDA mode against parameters benchmarked on laptop with unlocking time of 2,000 ms}
\label{tab:konosgpu}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
batch size & avg(time per hash) in ms & stdev(time per hash) in ms & avg(time per batch) in ms & stdev(time per batch) in ms\\
\hline
1 & 2,996.175 & 0.057 & 2,996.175 & 0.057\\
\hline
2 & 1,499.818 & 0.025 & 2,999.637 & 0.051\\
\hline
3 & 1,000.870 & 0.049 & 3,002.610 & 0.149\\
\hline
4 & 752.094 & 0.015 & 3,008.378 & 0.060\\
\hline
5 & 602.253 & 0.007 & 3,011.267 & 0.037\\
\hline
6 & 502.370 & 0.106 & 3,014.223 & 0.640\\
\hline
7 & 430.885 & 0.007 & 3,016.202 & 0.053\\
\hline
8 & 382.385 & 0.079 & 3,059.081 & 0.638\\
\hline
9 & 340.405 & 0.060 & 3,063.645 & 0.545\\
\hline
10 & 306.682 & 0.042 & 3,066.823 & 0.425\\
\hline
\end{tabularx}
\end{table}
\noindent
\begin{table}
\caption{Argon2-gpu-bench running on Konos in CUDA mode comparing various options against parameters benchmarked on laptop with unlocking time of 2,000 ms}
\label{tab:konosgpuopt}
\begin{tabularx}{\textwidth}{| X | X | X | X | X |}
\hline
option & avg(time per hash) in ms & stdev(time per hash) in ms & avg(time per batch) in ms & stdev(time per batch) in ms\\
\hline
master & 306.682 & 0.042 & 3,066.823 & 0.425 \\
\hline
master + precomputes & 294.535 & 0.045 & 2,945.355 & 0.456 \\
\hline
master + oneshot & 305.809 & 0.009 & 3,058.096 & 0.989 \\
\hline