-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: update to new reactant changes #1140
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
07811ad
to
879a599
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 919da19 | Previous: ac2879b | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4250 ns |
3625 ns |
1.17 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4417 ns |
4541 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5208 ns |
5125 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4042 ns |
3791 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
60545.5 ns |
61743 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10542 ns |
10125 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10333 ns |
10875 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11167 ns |
10334 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10917 ns |
10417 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
416590 ns |
430910 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1000 ns |
1209 ns |
0.83 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1250 ns |
1209 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1417 ns |
1500 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1167 ns |
1042 ns |
1.12 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
17774 ns |
18223.5 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3875 ns |
4000 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4000 ns |
4042 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4375 ns |
4334 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3959 ns |
3875 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
108234 ns |
110886 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57708 ns |
56709 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46833 ns |
38334 ns |
1.22 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38458 ns |
46917 ns |
0.82 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82000 ns |
81750 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37210 ns |
37932 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2035062.5 ns |
2043708.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2094209 ns |
2096520.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2101416 ns |
2096437.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1989645.5 ns |
1991167 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
194634 ns |
197294.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
142458 ns |
144625 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144334 ns |
145667 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
180625 ns |
144916 ns |
1.25 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
143709 ns |
144854.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166393 ns |
166157.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1119458 ns |
1116791 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1130458 ns |
1150459 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1149000 ns |
1128083 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1117666 ns |
1121458 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
533205 ns |
535998 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4042 ns |
3417 ns |
1.18 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3584 ns |
4042 ns |
0.89 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4729.5 ns |
4459 ns |
1.06 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3542 ns |
3187.5 ns |
1.11 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
71816 ns |
72464.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9916 ns |
9417 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9000 ns |
9458 ns |
0.95 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10750 ns |
9750 ns |
1.10 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9667 ns |
8708 ns |
1.11 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
477975 ns |
469472 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15750 ns |
14375 ns |
1.10 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15750 ns |
16208 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
16750 ns |
18750 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15625 ns |
16875 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
56473 ns |
54038 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
214833 ns |
213375 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
213917 ns |
220000 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216458 ns |
217250 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217625 ns |
213916 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
275838 ns |
270771 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
458 ns |
541 ns |
0.85 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
583 ns |
542 ns |
1.08 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
750 ns |
708 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
584 ns |
667 ns |
0.88 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17668 ns |
17308 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1417 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1542 ns |
1375 ns |
1.12 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1500 ns |
1541 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1542 ns |
1417 ns |
1.09 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
102524 ns |
101606.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
6958 ns |
7083 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5834 ns |
5250 ns |
1.11 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5250 ns |
5958 ns |
0.88 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9959 ns |
10084 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23943 ns |
23383 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221458 ns |
221709 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228271 ns |
229750 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230541.5 ns |
229125 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
226416 ns |
214125 ns |
1.06 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
171422 ns |
167775.5 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3959 ns |
4000 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4041 ns |
3917 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
24027 ns |
23070 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
17041 ns |
17083 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16875 ns |
16625 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
18250 ns |
17083 ns |
1.07 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16625 ns |
16833 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
165583.5 ns |
162035 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
758167 ns |
575083 ns |
1.32 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
571084 ns |
571792 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
604834 ns |
570750 ns |
1.06 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
577708 ns |
577208 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113622 ns |
113295 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1597000 ns |
1418250 ns |
1.13 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1419083 ns |
1422875 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1456833 ns |
1422500 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1431167 ns |
1425750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
217049.5 ns |
211866.5 ns |
1.02 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1089917 ns |
1081041.5 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
963875 ns |
946916.5 ns |
1.02 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1340062 ns |
1353229.5 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1297541 ns |
1292458 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
279085 ns |
269913.5 ns |
1.03 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
6040833 ns |
6001958 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4543541 ns |
4632042 ns |
0.98 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4963292 ns |
4929041.5 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5626250 ns |
5549750.5 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1103850 ns |
1070564 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
583 ns |
542 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
24109 ns |
23780 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2209 ns |
2209 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2209 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2291 ns |
2208 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2084 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
174843 ns |
170642 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4792 ns |
3667 ns |
1.31 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4000 ns |
4750 ns |
0.84 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5625 ns |
5208 ns |
1.08 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4229.5 ns |
4041 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
65901.5 ns |
65525 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11750 ns |
11084 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10958 ns |
12083 ns |
0.91 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12292 ns |
12208 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10833 ns |
10834 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
454002 ns |
445478.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6834 ns |
5917 ns |
1.15 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5958.5 ns |
6666 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8167 ns |
8167 ns |
1 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7125 ns |
6166 ns |
1.16 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
53600.5 ns |
52877 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17917 ns |
18250 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16834 ns |
18458 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
19333 ns |
18542 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17833 ns |
17520.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
303784.5 ns |
296963 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
583 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
667 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
33288 ns |
32928.5 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8917 ns |
9271 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8500 ns |
9208 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9625 ns |
9354.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8333 ns |
8375 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
160686 ns |
157633 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64459 ns |
64458 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64459 ns |
64917 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64625 ns |
64583 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64583 ns |
64375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
113043.5 ns |
111288 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
284041 ns |
278375 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
279709 ns |
292291 ns |
0.96 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
291417 ns |
278833 ns |
1.05 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
278625 ns |
279500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
190977 ns |
186917 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3386062.5 ns |
3287958 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3008417 ns |
2909792 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
2792479.5 ns |
3017771 ns |
0.93 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4070250 ns |
3935292 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
583066 ns |
579655 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7653479 ns |
7602875 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7452208 ns |
7372333 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7371125 ns |
7461313 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8216458 ns |
8220167 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1363708.5 ns |
1357048 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17481125 ns |
17533125 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17510645.5 ns |
17557125 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17683667 ns |
17531667 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14104479 ns |
9214250 ns |
1.53 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23681000 ns |
23446917 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34596792 ns |
43586125 ns |
0.79 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
40952833 ns |
37247062.5 ns |
1.10 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34800500 ns |
35028291.5 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1851452 ns |
1855921.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
189244000 ns |
189114500 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
164689667 ns |
178190333 ns |
0.92 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
158601875 ns |
153393396 ns |
1.03 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
433416000 ns |
434855500 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13922081 ns |
13947546 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
288413875 ns |
290046875 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
252370333 ns |
271392771 ns |
0.93 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
301765042 ns |
284812041.5 ns |
1.06 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
473272917 ns |
473569708.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21791 ns |
23021 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22333 ns |
22458 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
24062.5 ns |
23625 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21292 ns |
22708 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
97925 ns |
96516 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
103541.5 ns |
115458.5 ns |
0.90 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
103292 ns |
103250 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
109291 ns |
104375 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
113041 ns |
105042 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
508499 ns |
508001.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6166 ns |
5750 ns |
1.07 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6250 ns |
6500 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7042 ns |
6708 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6375 ns |
6125 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68955 ns |
68991.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15167 ns |
14042 ns |
1.08 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15125 ns |
15500 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16584 ns |
15687.5 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15084 ns |
14500 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
480887 ns |
478721 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
2994333 ns |
2979083.5 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2069146 ns |
2084000 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2295125 ns |
2281500 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4819500 ns |
4814250 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
585526 ns |
585630.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23646833 ns |
23560375 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18084583 ns |
18266583.5 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17337062.5 ns |
16959209 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
34862479.5 ns |
34863041.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2762523 ns |
2766675 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33429458 ns |
33305667 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27591875 ns |
27994104 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27784812.5 ns |
27448959 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41729875 ns |
40756916 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
72166 ns |
74000 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
72833 ns |
73333 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75292 ns |
74917 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74187.5 ns |
74500 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
103940 ns |
104050 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
313312.5 ns |
218083 ns |
1.44 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218000 ns |
210625 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219687.5 ns |
296708.5 ns |
0.74 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
294250 ns |
217792 ns |
1.35 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
554067 ns |
558286.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12395.5 ns |
11750 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11833 ns |
12417 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13042 ns |
12458.5 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12125 ns |
11834 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
72713.5 ns |
72847.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26395.5 ns |
26125 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
28083.5 ns |
27167 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
28229.5 ns |
27375 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26375 ns |
26458 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
477878 ns |
484580 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12750 ns |
11583 ns |
1.10 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
14625 ns |
12167 ns |
1.20 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14166 ns |
14000 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12520.5 ns |
11792 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
54116 ns |
55176 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26584 ns |
25542 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25542 ns |
26417 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26458 ns |
28709 ns |
0.92 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26125 ns |
26042 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
308798 ns |
307604.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
180166 ns |
179208 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
181625 ns |
181042 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
182208 ns |
184333.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
179875 ns |
179416 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
58421.5 ns |
57654 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
584145.5 ns |
590646 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
583250 ns |
591479 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
593083 ns |
593500 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
584062.5 ns |
582749.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
291155 ns |
291261 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7167 ns |
6083.5 ns |
1.18 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6458 ns |
6375 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7375 ns |
6708 ns |
1.10 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6458 ns |
6292 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
71417 ns |
71643 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13958 ns |
14250 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14042 ns |
15167 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15625 ns |
15292 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14042 ns |
14042 ns |
1 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
465468.5 ns |
470922.5 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1214770.5 ns |
1203770.5 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1241021 ns |
1236645.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1278208 ns |
1343083 ns |
0.95 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1017875 ns |
1024395.5 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
300851 ns |
300123 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4131709 ns |
4091000 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4413167 ns |
4576917 ns |
0.96 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4749833 ns |
4574875.5 ns |
1.04 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3696958 ns |
3718250 ns |
0.99 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1051856 ns |
1038641 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
24180 ns |
23874.5 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5458 ns |
5083 ns |
1.07 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4875 ns |
5000 ns |
0.97 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4958 ns |
4959 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4917 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
193923 ns |
193867 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6208 ns |
5500 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5666 ns |
5709 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7166 ns |
6875 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5375 ns |
5416 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
56537.5 ns |
57200 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11833 ns |
11042 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10708 ns |
11584 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11750 ns |
11500 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10625 ns |
10625 ns |
1 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
335261.5 ns |
332575 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
334 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
334 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23408 ns |
22978 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2834 ns |
0.97 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2833 ns |
2792 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3083 ns |
3000 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2709 ns |
2833 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
163803 ns |
163496 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11750 ns |
11625 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11542 ns |
11292 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13625 ns |
12875 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11334 ns |
11209 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
57990 ns |
58225 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25375 ns |
24958 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24917 ns |
25208 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25083 ns |
25375 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25083 ns |
25042 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
298812.5 ns |
299318 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4209 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4250 ns |
4250 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4333 ns |
4250 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
25294 ns |
25190 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16458 ns |
16209 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16209 ns |
16083 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16375 ns |
16625 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16459 ns |
16500 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
200954 ns |
202972 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5833 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5834 ns |
5792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5959 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5792 ns |
5792 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
34544 ns |
34611 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21709 ns |
20625 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20666 ns |
21042 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21333 ns |
21083 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20583 ns |
20125 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
177621.5 ns |
178483.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
423313 ns |
414125 ns |
1.02 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
382813 ns |
367771 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
478750 ns |
480813 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
103687 ns |
104146 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67974 ns |
67750.5 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
940771 ns |
927125 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
972041 ns |
964354 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1199312 ns |
1186833 ns |
1.01 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
458417 ns |
376584 ns |
1.22 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
192965.5 ns |
192974.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
78666 ns |
77583 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80042 ns |
79125 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83250 ns |
83542 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81417 ns |
79958 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194531 ns |
193934 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1927417 ns |
1917959 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1916667 ns |
1933541 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1941229 ns |
1931521.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1911792 ns |
1860375 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
399292 ns |
392771 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22487 ns |
22416 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1875 ns |
1792 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
176001 ns |
174762 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6875 ns |
6562.5 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6666 ns |
6417 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8250 ns |
8166 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6291 ns |
6208 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
62036.5 ns |
59227 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9667 ns |
9292 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9291 ns |
9250 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9458 ns |
9375 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9292 ns |
9083 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
312381 ns |
304901.5 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
119519083 ns |
120543687.5 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173659083 ns |
181954416.5 ns |
0.95 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
155017583.5 ns |
148126750 ns |
1.05 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
109227833 ns |
106134709 ns |
1.03 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5478400.5 ns |
5492614.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
610863958.5 ns |
609833750 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
554853375 ns |
578593208 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
467181500 ns |
451045708.5 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
631585749.5 ns |
627478333.5 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
35037360 ns |
35107131 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
659103917 ns |
652518625 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
666596750.5 ns |
683671437.5 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
598853833 ns |
587115583.5 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
859036125 ns |
852245209 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58958 ns |
58000 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47709 ns |
39209 ns |
1.22 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
39167 ns |
48208 ns |
0.81 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83917 ns |
85167 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38470 ns |
38635 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1931666.5 ns |
1920104 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1970291 ns |
1988000 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1989437.5 ns |
1980667 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1896979 ns |
1907896 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
176795 ns |
176329 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
271000 ns |
267041 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
269167 ns |
270500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
268125 ns |
268750 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
268729.5 ns |
265291 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
132025 ns |
123893.5 ns |
1.07 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
693354 ns |
596166 ns |
1.16 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
590292 ns |
698625 ns |
0.84 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
693583 ns |
702916.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
586792 ns |
589292 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
704990 ns |
677537.5 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2220146 ns |
2180187.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2101041 ns |
2215229 ns |
0.95 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2194416 ns |
2212000 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2228875 ns |
2207792 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
134834 ns |
133207 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5534354 ns |
5497667 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5511458 ns |
5581500 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5557417 ns |
5516125 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5496958.5 ns |
5545124.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
743902.5 ns |
717120 ns |
1.04 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
650208 ns |
656041 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
638375 ns |
642917 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
646333 ns |
637375 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
643833 ns |
644167 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47191 ns |
46463 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2012645.5 ns |
1822875 ns |
1.10 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1722083 ns |
1668958.5 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1693917 ns |
1723334 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2087875 ns |
2101084 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
227988 ns |
222123 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58333 ns |
57667 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47292 ns |
38708 ns |
1.22 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38583 ns |
46916 ns |
0.82 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80333 ns |
85084 ns |
0.94 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29022 ns |
28664 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2030250 ns |
2028604.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2080125 ns |
2097916.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2106500 ns |
2087625 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1993958.5 ns |
2005812 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
192771 ns |
188609 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13319937.5 ns |
13343604 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12424667 ns |
12536250 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12622916.5 ns |
12547834 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15202125 ns |
15250271 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
514648 ns |
510611.5 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47233125 ns |
47204500 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41735771 ns |
41927292 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41178479 ns |
40799666 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58297229 ns |
58864104 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2892057 ns |
2889030 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
98259354 ns |
73523334 ns |
1.34 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
90797500.5 ns |
91557750 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90704083 ns |
90571250.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
75491375 ns |
75976041 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59458 ns |
58083 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47334 ns |
38875 ns |
1.22 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
38916 ns |
47709 ns |
0.82 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83979.5 ns |
82042 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
47087 ns |
48950 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1910084 ns |
1916542 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1963437.5 ns |
1982083 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1980875 ns |
1947333 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1888604.5 ns |
1876854 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
191058.5 ns |
195268 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
31728 ns |
32997 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
5834 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6125 ns |
6500 ns |
0.94 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6709 ns |
6458.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
5875 ns |
5958 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
169690 ns |
171034 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31889 ns |
32918 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2917 ns |
2750 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2667 ns |
2750 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2917 ns |
2917 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2584 ns |
2625 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
160305 ns |
161268 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
284340292 ns |
286917729.5 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
340149791 ns |
347948583.5 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
322556750 ns |
314136145.5 ns |
1.03 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
270136688 ns |
267700542 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7102249 ns |
7080984 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1011930542 ns |
1009676125 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
955985083 ns |
974877416 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
866325687.5 ns |
854637270.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1259848375 ns |
1260982959 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33979712 ns |
34048271 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1690071000 ns |
1387098104 ns |
1.22 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1662883667 ns |
1694333625 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1620035750 ns |
1631003167 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1367046916.5 ns |
1358038896 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1412125 ns |
1411604.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1415208 ns |
1409250 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1449250 ns |
1407354.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1460625 ns |
1405916 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
127339 ns |
128067 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5016458 ns |
5023999.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5018750 ns |
5051396 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5046354 ns |
5029104.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5022583 ns |
5040479 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
544866 ns |
514176 ns |
1.06 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
176280167 ns |
170919250 ns |
1.03 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
132331875 ns |
183735542 ns |
0.72 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
135736958 ns |
115460229.5 ns |
1.18 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
167836916 ns |
168486416 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4881636 ns |
4853309 ns |
1.01 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
623422625 ns |
627387000 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
493161333 ns |
561666625 ns |
0.88 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
484744167 ns |
453969542 ns |
1.07 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
644784250 ns |
654142166 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16297774 ns |
17017885 ns |
0.96 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8928354 ns |
8912729 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8925125 ns |
9063708 ns |
0.98 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
8041770.5 ns |
7941979 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9741417 ns |
9820979.5 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1612242.5 ns |
1590505 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36167375 ns |
36015084 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37094208.5 ns |
38799959 ns |
0.96 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
34727479.5 ns |
33679959 ns |
1.03 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
37824167 ns |
37936417 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6467622 ns |
6472671 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47541 ns |
47459 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47375 ns |
47708 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47792 ns |
47625 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
49416 ns |
47209 ns |
1.05 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18159 ns |
17832 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50417 ns |
50416 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50542 ns |
50292 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50625 ns |
50458 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
52291 ns |
50291 ns |
1.04 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
191677.5 ns |
162828 ns |
1.18 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6916 ns |
6208 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6917 ns |
7083 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8167 ns |
7562.5 ns |
1.08 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6667 ns |
6292 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
96135.5 ns |
74130 ns |
1.30 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10250 ns |
9375 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10042 ns |
10250 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10458 ns |
10375 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9833 ns |
9917 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
546102.5 ns |
422862.5 ns |
1.29 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6125 ns |
5666 ns |
1.08 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6083 ns |
6500 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8562.5 ns |
6916 ns |
1.24 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5416 ns |
5375 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
107028 ns |
78877.5 ns |
1.36 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13375 ns |
12875 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13209 ns |
13583 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13625 ns |
13583 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12833 ns |
13208 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
472048.5 ns |
370972.5 ns |
1.27 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
31885 ns |
33127 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8042 ns |
7792 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7875 ns |
8167 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8458 ns |
8083 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7792 ns |
7792 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
202616 ns |
187081.5 ns |
1.08 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23250 ns |
23333 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23375 ns |
23417 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23542 ns |
23583 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23625 ns |
23084 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18255 ns |
18527 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
60125 ns |
52042 ns |
1.16 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52708 ns |
52750 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
53000 ns |
52875 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52417 ns |
52542 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
275901.5 ns |
204233 ns |
1.35 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1447666 ns |
1398875 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1396396 ns |
1455625 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1414541.5 ns |
1404042 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1407958 ns |
1406584 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
195966 ns |
196492.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5024417 ns |
4999875 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5002833.5 ns |
5037708 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5034625 ns |
5003083 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5015916 ns |
5024916 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
574215 ns |
495167 ns |
1.16 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3068312.5 ns |
3047396 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2054937 ns |
2106521 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2303167 ns |
2296895.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4878104 ns |
4962229.5 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
579217 ns |
583841 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24460958 ns |
24384458 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18917458 ns |
19075709 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
18139458 ns |
17765562.5 ns |
1.02 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35387896 ns |
35955916.5 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2842977 ns |
2836787 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34194542 ns |
33991937.5 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28322916 ns |
28748917 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28354250 ns |
28081042 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41556541.5 ns |
41668854.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
145015458 ns |
142678458 ns |
1.02 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
147469958 ns |
147270333 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
128511354 ns |
126985770.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
172953062.5 ns |
174826021 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22761553 ns |
22556485 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
915634292 ns |
1026522125 ns |
0.89 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1051554709 ns |
866022875.5 ns |
1.21 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
715776375 ns |
743843334 ns |
0.96 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
670007458 ns |
682878792 ns |
0.98 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118504283 ns |
116543149 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
73166 ns |
76083 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
73750 ns |
76250 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
83417 ns |
77625 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74229.5 ns |
75833.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
235784.5 ns |
163749.5 ns |
1.44 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
293479.5 ns |
275437.5 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
281667 ns |
283542 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
288396 ns |
275959 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
278875 ns |
282375 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1139500 ns |
882740 ns |
1.29 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35639208 ns |
35483000 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36273062.5 ns |
36565000 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32703375 ns |
32543896 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40346792 ns |
40679500 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5844270 ns |
5828412 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
149434959 ns |
147536708 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
152415104 ns |
157209875 ns |
0.97 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
142330750.5 ns |
136063312.5 ns |
1.05 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
285092041 ns |
286255000 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34884663.5 ns |
34875549.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
121668666.5 ns |
122158104.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174019875 ns |
181447688 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
155187042 ns |
147872917 ns |
1.05 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
105181771 ns |
104774833.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5457894.5 ns |
5433572 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
470857167 ns |
468969166 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
466842750 ns |
487732687.5 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
453615958.5 ns |
437061208 ns |
1.04 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
741330708 ns |
745602708 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32286234.5 ns |
31632434 ns |
1.02 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
709843583.5 ns |
708533125.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
639403187.5 ns |
662068729.5 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
638067958 ns |
625681375 ns |
1.02 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
851157083 ns |
856533500 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1317083 ns |
1243917 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
959500 ns |
778625 ns |
1.23 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
786666 ns |
961709 ns |
0.82 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2054875 ns |
2098041.5 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
571154.5 ns |
581626.5 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2984479 ns |
2966062.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2598125 ns |
2513979 ns |
1.03 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2526417 ns |
2620167 ns |
0.96 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3683833 ns |
3551916 ns |
1.04 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1687708.5 ns |
1532656 ns |
1.10 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5816500 ns |
5803146 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5788625 ns |
5896375 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5895750 ns |
5798708 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2907313 ns |
2924083 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7500 ns |
7083 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5291 ns |
1.14 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5209 ns |
6208 ns |
0.84 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10166 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24719 ns |
25159 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
214000.5 ns |
212500 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221583 ns |
220625 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
249541 ns |
220709 ns |
1.13 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206375.5 ns |
213625 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
241285.5 ns |
199491.5 ns |
1.21 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
298836125 ns |
297113041 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
217317167 ns |
291058458 ns |
0.75 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
222687250 ns |
193310291.5 ns |
1.15 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
305332729 ns |
304396812.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7669579.5 ns |
7678125.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1232586187.5 ns |
1231332166.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
895853854.5 ns |
973933875 ns |
0.92 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
856606125 ns |
836913500 ns |
1.02 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1147583834 ns |
1148765416.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26801390 ns |
26856489.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5667 ns |
4792 ns |
1.18 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5417 ns |
5875 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7041.5 ns |
6354 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4708 ns |
4667 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
139155 ns |
93183 ns |
1.49 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7750 ns |
7000 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7209 ns |
7625 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7791 ns |
7458 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7166.5 ns |
7395.5 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
578343 ns |
440751 ns |
1.31 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
667 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
541 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23543 ns |
24653 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
8625 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9167 ns |
9500 ns |
0.96 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9583 ns |
9917 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8750 ns |
8792 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
205792 ns |
176547.5 ns |
1.17 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351500 ns |
353584 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
352208 ns |
353833 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
356521 ns |
352208 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
366209 ns |
351500 ns |
1.04 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
21137 ns |
21275 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
956604.5 ns |
807916.5 ns |
1.18 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
812375 ns |
789854 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
824124.5 ns |
776042 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
827500 ns |
778833 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
249327.5 ns |
215262.5 ns |
1.16 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
341833 ns |
339229 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
340937.5 ns |
321000 ns |
1.06 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
442583 ns |
454187 ns |
0.97 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
11417 ns |
10916 ns |
1.05 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18313 ns |
18631 ns |
0.98 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
724916 ns |
714125 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
726875 ns |
731625 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1024666 ns |
1006333 ns |
1.02 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
26917 ns |
26667 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
230289 ns |
196596.5 ns |
1.17 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
377521 ns |
381833.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
350437 ns |
330959 ns |
1.06 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
445000 ns |
444916.5 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
32333 ns |
31417 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
23090.5 ns |
23162 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
739895.5 ns |
727875 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
782083 ns |
783542 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1059958 ns |
1030146 ns |
1.03 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
105167 ns |
90750 ns |
1.16 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
212967.5 ns |
193002.5 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3583 ns |
3583 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3625 ns |
3709 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3625 ns |
1.09 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3417 ns |
3375 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17545 ns |
17634 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4208 ns |
4291 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4250 ns |
4208 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4334 ns |
4333 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4167 ns |
4125 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
229598.5 ns |
200435.5 ns |
1.15 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4125 ns |
3500 ns |
1.18 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3875 ns |
4167 ns |
0.93 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4625 ns |
4375 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3729.5 ns |
3583 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
170139.5 ns |
151437.5 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8667 ns |
8458 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8583 ns |
8583 ns |
1 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
8333 ns |
1.08 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8625 ns |
8458 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1096811 ns |
927946.5 ns |
1.18 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
206125 ns |
204583 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
212625 ns |
209000 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210417 ns |
210500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
202042 ns |
199084 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34232 ns |
35183 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
659375 ns |
602833.5 ns |
1.09 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
633083 ns |
629209 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
628854 ns |
625584 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
633437 ns |
582250 ns |
1.09 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
323068 ns |
266930.5 ns |
1.21 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
1008292 ns |
990542 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1015979 ns |
1053625 ns |
0.96 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
974167 ns |
954292 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
862937.5 ns |
901104 ns |
0.96 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
206883.5 ns |
206789.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4567708 ns |
4511208 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4726583 ns |
4854542 ns |
0.97 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4602042 ns |
4490209 ns |
1.02 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
4286146 ns |
4299083.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
942340.5 ns |
930739 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3791.5 ns |
3084 ns |
1.23 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3500 ns |
3500 ns |
1 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4458 ns |
4083.5 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
2959 ns |
3000 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
207895.5 ns |
144120 ns |
1.44 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7583 ns |
7250 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7167 ns |
7333 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7750 ns |
7500 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6958 ns |
7041 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
944054 ns |
806482 ns |
1.17 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1657834 ns |
1636250 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1192833 ns |
1158208.5 ns |
1.03 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1368874.5 ns |
1368083 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2488875 ns |
2308063 ns |
1.08 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
213098 ns |
214505 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12377666.5 ns |
12270583 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9592479.5 ns |
9567750 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9354583 ns |
9243645.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18043145.5 ns |
18134146 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1952123 ns |
1954133 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17432291.5 ns |
17281250 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14401833 ns |
14453375 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14480062.5 ns |
14325333 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21148458 ns |
21045500 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
88125 ns |
85708 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
89791 ns |
91520.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
92791 ns |
93250 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
135604 ns |
87833.5 ns |
1.54 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125691 ns |
126207 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2058625 ns |
2017958 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2024187.5 ns |
2050542 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2032813 ns |
2029834 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2026709 ns |
2026959 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
947506 ns |
841405 ns |
1.13 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
1771 ns |
1375 ns |
1.29 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
2833 ns |
1917 ns |
1.48 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
2625 ns |
3583.5 ns |
0.73 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
2583 ns |
2375 ns |
1.09 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16378 ns |
16017 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2792 ns |
2875 ns |
0.97 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2834 ns |
2833 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2875 ns |
2750 ns |
1.05 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2875 ns |
2792 ns |
1.03 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
180377 ns |
165765.5 ns |
1.09 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7416 ns |
7208 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5792 ns |
5333 ns |
1.09 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5250 ns |
5958 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9958 ns |
10084 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33352 ns |
34231 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
242166 ns |
214458 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221042 ns |
220042 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
249104 ns |
221416 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
242479 ns |
235834 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
308082 ns |
263066.5 ns |
1.17 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3791 ns |
3750 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22858 ns |
22879.5 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14625 ns |
14459 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14541 ns |
14375 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14292 ns |
14541 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14375 ns |
14500 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
438745.5 ns |
399546.5 ns |
1.10 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
89917 ns |
94312.5 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
93875 ns |
95875 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
96334 ns |
97583 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
133562.5 ns |
94354.5 ns |
1.42 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125482 ns |
125486.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1940020.5 ns |
1919437.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1879854.5 ns |
1938250 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1920999.5 ns |
1927084 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1923000 ns |
1803750 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
919331 ns |
794850 ns |
1.16 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
873021 ns |
875354.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
816145.5 ns |
802104.5 ns |
1.02 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1190583 ns |
1225042 ns |
0.97 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
955292 ns |
970374.5 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
275062.5 ns |
273954 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2847292 ns |
2714354 ns |
1.05 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2517562.5 ns |
2504167 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3362125 ns |
3360375 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3400500 ns |
3360334 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1539380 ns |
1467965 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15250 ns |
17542 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15104.5 ns |
16937.5 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19333 ns |
18708 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17542 ns |
14584 ns |
1.20 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
131045 ns |
129735 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
263292 ns |
214709 ns |
1.23 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
222083 ns |
215958.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
257020.5 ns |
215562.5 ns |
1.19 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
238583 ns |
217958 ns |
1.09 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
609944 ns |
539139.5 ns |
1.13 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
221709 ns |
223375 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
219709 ns |
220958 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
223479 ns |
222645.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
223167 ns |
219625 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
249010 ns |
217203.5 ns |
1.15 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
510083 ns |
495895.5 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
516687.5 ns |
506625 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
556313 ns |
510958 ns |
1.09 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
559312.5 ns |
561375 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1300363 ns |
1153506.5 ns |
1.13 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
4250 ns |
3917 ns |
1.09 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
3625 ns |
4667 ns |
0.78 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
7542 ns |
4834 ns |
1.56 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
4833 ns |
4833 ns |
1 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
17278 ns |
17326 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7792 ns |
7520.5 ns |
1.04 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7209 ns |
7625 ns |
0.95 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7375 ns |
7458 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7916 ns |
7417 ns |
1.07 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
183913 ns |
176736 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17458 ns |
16646 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17500 ns |
18500 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21166 ns |
19625 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18688 ns |
18042 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
158861.5 ns |
133143.5 ns |
1.19 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223667 ns |
213000 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212375 ns |
212916 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221937.5 ns |
213667 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
224145.5 ns |
224895.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
899820.5 ns |
820129 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5000 ns |
4354.5 ns |
1.15 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4500 ns |
4625 ns |
0.97 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5729.5 ns |
4917 ns |
1.17 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4292 ns |
3875 ns |
1.11 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
211323 ns |
175343 ns |
1.21 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10417 ns |
10208 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10229.5 ns |
10333 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11541 ns |
10834 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10167 ns |
10208 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
989117 ns |
980341 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4250 ns |
3250 ns |
1.31 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3333 ns |
3687.5 ns |
0.90 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4542 ns |
4292 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3084 ns |
2917 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
225450 ns |
215866 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7833 ns |
7166 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7333 ns |
7625 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8167 ns |
7792 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7375 ns |
7375 ns |
1 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1035497 ns |
1015020 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23624542 ns |
23687417 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33815500 ns |
42666354 ns |
0.79 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
41859750 ns |
37344478.5 ns |
1.12 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34867229.5 ns |
34948333.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1840044 ns |
1824017 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
184225583 ns |
183871416 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
161816833 ns |
182812313 ns |
0.89 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
151611708 ns |
145975437.5 ns |
1.04 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
274515291 ns |
274277542 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16515936 ns |
16507012 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
277196958 ns |
273782791 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
248143166.5 ns |
257949042 ns |
0.96 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
235344583.5 ns |
231995083.5 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
322828854 ns |
323882958.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182875 ns |
183541 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
184979.5 ns |
184000 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
187209 ns |
185292 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
184500 ns |
182542 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
204829 ns |
191911.5 ns |
1.07 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
602729 ns |
629458.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
596541 ns |
587334 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
634500.5 ns |
587125.5 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
602958 ns |
649291 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1013698 ns |
963628 ns |
1.05 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3901958 ns |
3851750 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3918875 ns |
3983792 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3498187.5 ns |
3579833 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4548000 ns |
4612292 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
533439 ns |
531156 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17561917 ns |
17385812.5 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17872583 ns |
18439958.5 ns |
0.97 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16935250 ns |
16577084 ns |
1.02 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
19951708 ns |
20232667 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2633316 ns |
2638769 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
666 ns |
625 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32257.5 ns |
32361 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9708 ns |
9312.5 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8791 ns |
9604.5 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9291 ns |
9541 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8875 ns |
8750 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
250188 ns |
248738 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
649952937 ns |
650277229.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
391354542 ns |
513797917 ns |
0.76 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
391183375 ns |
364513416 ns |
1.07 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
746940334 ns |
753229708 ns |
0.99 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12477772 ns |
11759811 ns |
1.06 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1887496208.5 ns |
1878034500 ns |
1.01 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1648979250 ns |
1671899375 ns |
0.99 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1564983479.5 ns |
1507608416.5 ns |
1.04 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2203608375 ns |
2202946667 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49211631 ns |
49516620 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1635500 ns |
1535958.5 ns |
1.06 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1195584 ns |
1179292 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1387333 ns |
1380729.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2470375 ns |
2368083 ns |
1.04 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215198 ns |
215337 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12724458 ns |
12730083 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9938666.5 ns |
9937625 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9731979 ns |
9659583.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18426875 ns |
18459917 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2016352 ns |
2010689 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17702917 ns |
17677292 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14721708 ns |
14810083 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14768458 ns |
14573229.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21449083 ns |
21483000 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
34500 ns |
26292 ns |
1.31 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26292 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26458 ns |
26250 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26208 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24380 ns |
23665 ns |
1.03 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
75375 ns |
67166 ns |
1.12 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67208 ns |
66875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67458 ns |
67250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66958 ns |
66958 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
385848.5 ns |
367986.5 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203667 ns |
204583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210208 ns |
209292 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210125 ns |
210500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199667 ns |
199625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26371.5 ns |
26073 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
605375 ns |
613125 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
633250 ns |
625459 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
631166 ns |
633583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
630666.5 ns |
632083 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
330545 ns |
320857.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
640083 ns |
592750 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
639042 ns |
647000 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
637792 ns |
648834 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
677458.5 ns |
671792 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132723 ns |
131354 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2068854 ns |
2247291 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2257375 ns |
2303208 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2298667 ns |
2243604 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2236625.5 ns |
2314875.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1163408.5 ns |
1083962 ns |
1.07 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17542 ns |
16687.5 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17750 ns |
18458 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19958 ns |
19770.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18417 ns |
18146 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
134481 ns |
132087.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225354.5 ns |
229375 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
230375 ns |
262896 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
259167 ns |
231208 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
257500 ns |
258624.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
992082 ns |
885149.5 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
667 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23827 ns |
23686 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9937.5 ns |
8708 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9709 ns |
10000 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10083 ns |
10000 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9334 ns |
9250 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
248833.5 ns |
241904 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5834 ns |
5417 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5458 ns |
5583 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7084 ns |
6417 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4791 ns |
4770.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
213713 ns |
194851.5 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7792 ns |
7667 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7416 ns |
7417 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7625 ns |
7792 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6834 ns |
7250 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
743208.5 ns |
705733 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2208 ns |
2167 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2291 ns |
2208 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2709 ns |
2542 ns |
1.07 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2500 ns |
2208 ns |
1.13 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
18412 ns |
17804 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
7125 ns |
6541 ns |
1.09 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6500 ns |
6500 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6895.5 ns |
6875 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6417 ns |
6417 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
307058 ns |
294742 ns |
1.04 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
748959 ns |
746916 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
746833.5 ns |
761333 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
750750 ns |
750541 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
761250 ns |
749459 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
21256 ns |
20924 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
778417 ns |
790875 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
792500 ns |
777375 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
789250 ns |
792500 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
813708.5 ns |
778250 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
344148.5 ns |
268681.5 ns |
1.28 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7459 ns |
7375 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5916 ns |
5250 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5334 ns |
5875 ns |
0.91 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10167 ns |
10292 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33664 ns |
32725 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
233104 ns |
219208 ns |
1.06 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
238000 ns |
230937.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
266729 ns |
236625 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
256437.5 ns |
214312.5 ns |
1.20 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
340384 ns |
332717.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11000 ns |
10291 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10667 ns |
10937.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11687.5 ns |
10625 ns |
1.10 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9875 ns |
9916 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
229936 ns |
219475.5 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24917 ns |
24416 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24458 ns |
25417 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25229.5 ns |
24875 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24750 ns |
24354.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1065422 ns |
1060762 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106345834 ns |
106190416 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
116679584 ns |
126215417 ns |
0.92 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
124892167 ns |
120200125 ns |
1.04 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
121182250 ns |
117655917 ns |
1.03 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2640967 ns |
2587994 ns |
1.02 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
372829625 ns |
395454916.5 ns |
0.94 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
365975167 ns |
372350083.5 ns |
0.98 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
359810083 ns |
355285895.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
540815125 ns |
542892500 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15188698.5 ns |
15209611 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
789897584 ns |
607219000 ns |
1.30 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
753438333 ns |
775694542 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
749396312 ns |
743546708 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
605212292 ns |
606917208 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8333 ns |
6729.5 ns |
1.24 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6750 ns |
7458 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8625 ns |
8791 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6250 ns |
6084 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
219257.5 ns |
214170 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15042 ns |
14645.5 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13959 ns |
14167 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14791 ns |
14334 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13292 ns |
13417 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
990845 ns |
1010027 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6417 ns |
6042 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6708.5 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8395.5 ns |
6958 ns |
1.21 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5250 ns |
5166.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
213956 ns |
211003 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13208 ns |
12916 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12500 ns |
12979.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13583 ns |
13041 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12646 ns |
12375 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
727501.5 ns |
725511 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
5667 ns |
5792 ns |
0.98 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
5667 ns |
6084 ns |
0.93 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6625 ns |
7166 ns |
0.92 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
6084 ns |
5979.5 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
17881 ns |
16985 ns |
1.05 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15583 ns |
16375 ns |
0.95 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15791 ns |
15917 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
15833 ns |
15750 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
15833 ns |
15750 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
190525 ns |
184955.5 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
416 ns |
417 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23435 ns |
23469 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6583 ns |
6375 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6333 ns |
6292 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6666 ns |
6458 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6083 ns |
6020.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
231277 ns |
226513 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6041 ns |
5917 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5917 ns |
6000 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5958 ns |
6083 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5833 ns |
5833 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24889 ns |
24637 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21459 ns |
21375 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
20750 ns |
21083 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21708 ns |
21167 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
20792 ns |
20875 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
253486 ns |
248819 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
141812.5 ns |
144938 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
146833 ns |
147666 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
147959 ns |
147500 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
183791 ns |
144208 ns |
1.27 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167855.5 ns |
166863.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1329291.5 ns |
1328917 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1316334 ns |
1366916.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1364479.5 ns |
1323667 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1330958 ns |
1330125 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1258549 ns |
1231201 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22271 ns |
21917 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24583 ns |
23250 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25229.5 ns |
25417 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21708.5 ns |
24583 ns |
0.88 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
319594.5 ns |
261684.5 ns |
1.22 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
126312.5 ns |
126249.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
182459 ns |
132125 ns |
1.38 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
141771 ns |
180458 ns |
0.79 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
180167 ns |
182166 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1369996 ns |
1329052 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23454 ns |
23064 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6958 ns |
6417 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6500 ns |
6500 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6792 ns |
6583 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6125 ns |
6083 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
246852.5 ns |
241726 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4917 ns |
4583 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4959 ns |
4875 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5500 ns |
5062.5 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4604.5 ns |
4375 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
232810 ns |
230879.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10417 ns |
9792 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10167 ns |
10375 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10625 ns |
10333 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10125 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1282482 ns |
1281938 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2083 ns |
1584 ns |
1.32 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1584 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1584 ns |
1583 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23442 ns |
23016.5 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6208 ns |
5709 ns |
1.09 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5791 ns |
5750 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6042 ns |
6042 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5625 ns |
5625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
267763 ns |
260870.5 ns |
1.03 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6859729 ns |
6736854 ns |
1.02 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6395458 ns |
6358292 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6542021 ns |
6526333 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7519417 ns |
7511917 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214734 ns |
214549 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24102042 ns |
24072542 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21283520.5 ns |
21309271.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21067770.5 ns |
21010584 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29766375 ns |
29840125 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2099955 ns |
2110310.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
48794625 ns |
37228250 ns |
1.31 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45571125 ns |
45827250 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45912708.5 ns |
45480416 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
38124542 ns |
38465479 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6375 ns |
5708 ns |
1.12 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5666 ns |
5708 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7208 ns |
6729.5 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5145.5 ns |
5208.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
213202 ns |
215925.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9417 ns |
8833 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8375 ns |
8417 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9250 ns |
8625 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7812.5 ns |
8145.5 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
998630 ns |
1004537.5 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1544562 ns |
1503813 ns |
1.03 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1262083 ns |
1243541.5 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1636479.5 ns |
1631312.5 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2152167 ns |
2004542 ns |
1.07 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
270829 ns |
280207 ns |
0.97 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7903687.5 ns |
7912062.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6594396 ns |
6650042 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7242104 ns |
7185875 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10442041 ns |
10076645.5 ns |
1.04 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1747104 ns |
1812720 ns |
0.96 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
370041.5 ns |
371770.5 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
370250.5 ns |
359708 ns |
1.03 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
454958 ns |
457000 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
23917 ns |
27125 ns |
0.88 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
42600 ns |
47414 ns |
0.90 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
736417 ns |
728042 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
807374.5 ns |
792916 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1085625 ns |
1060625 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
122042 ns |
122625 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
284873 ns |
280856 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397583 ns |
397666 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288083 ns |
213417 ns |
1.35 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
212416 ns |
288291 ns |
0.74 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
751666 ns |
754041 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43881 ns |
44363 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
665750 ns |
669875 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
530125 ns |
474875 ns |
1.12 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
473125 ns |
529792 ns |
0.89 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
973792 ns |
975625 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
189989 ns |
194646.5 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
540208 ns |
678312.5 ns |
0.80 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
668562.5 ns |
642583 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
641583 ns |
646625 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
681562.5 ns |
638374.5 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131637 ns |
132515 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2469958 ns |
2433792 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2461917 ns |
2525125 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2504250 ns |
2458416 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2441083.5 ns |
2464167 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1190231 ns |
1286025 ns |
0.93 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
2666 ns |
4270.5 ns |
0.62 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
3625 ns |
2791 ns |
1.30 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
3667 ns |
4334 ns |
0.85 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
3812 ns |
3021 ns |
1.26 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16354 ns |
17018 ns |
0.96 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5542 ns |
5583 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5583 ns |
5542 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5666 ns |
5500 ns |
1.03 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5667 ns |
5584 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
185811 ns |
187936.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1466000 ns |
1463042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1503166 ns |
1495875 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1495792 ns |
1503458 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1442209 ns |
1446334 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39970 ns |
41308.5 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5145583 ns |
5127000 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5007666.5 ns |
5300416.5 ns |
0.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5316458 ns |
5293458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4993834 ns |
4725667 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
195858 ns |
195229 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3709 ns |
3709 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3708 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33333 ns |
33264.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15459 ns |
15250 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15334 ns |
15083 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15292 ns |
15417 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15250 ns |
15125 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
350828.5 ns |
350238 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71333 ns |
71333 ns |
1 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71250 ns |
71417 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71292 ns |
71208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71375 ns |
71500 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113509 ns |
112408 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
319417 ns |
318125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
317625 ns |
327584 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
331292 ns |
319500 ns |
1.04 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
317958 ns |
320333 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
193069 ns |
194166 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1084 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1000 ns |
1084 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1125 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1000 ns |
1000 ns |
1 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23211 ns |
23803 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8292 ns |
8000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7958 ns |
8417 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8208 ns |
8417 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7959 ns |
7708 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
246708.5 ns |
246141 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
509458.5 ns |
501979.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
482979.5 ns |
480104 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
563021 ns |
566979 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
212958 ns |
220416 ns |
0.97 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129326 ns |
128980 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1419062.5 ns |
1391667 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1446875 ns |
1479770.5 ns |
0.98 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1754146 ns |
1756604 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
865291.5 ns |
864792 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
277815 ns |
275170 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
417 ns |
417 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31588 ns |
31717 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6750 ns |
6625 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6375 ns |
6542 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6687.5 ns |
6500 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6042 ns |
5958 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
248201.5 ns |
248251 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1720375 ns |
1776021 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1724916.5 ns |
1733687.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1768083 ns |
1727458 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1782333 ns |
1726125 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
168014.5 ns |
167904 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4374395.5 ns |
4363208 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4350708.5 ns |
4382750 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4408750.5 ns |
4374000 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4358000 ns |
4367334 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1090956.5 ns |
1079923 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6500 ns |
6875 ns |
0.95 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6583 ns |
6708 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7562.5 ns |
6792 ns |
1.11 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6750 ns |
6666 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
20551 ns |
19517 ns |
1.05 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
48562.5 ns |
59895.5 ns |
0.81 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
32708 ns |
49208 ns |
0.66 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
69709 ns |
52583 ns |
1.33 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
70167 ns |
32417 ns |
2.16 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
195885 ns |
267079.5 ns |
0.73 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
17875 ns |
18084 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
17584 ns |
18292 ns |
0.96 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
19500 ns |
19709 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
18375 ns |
18292 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18679 ns |
18390 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53375 ns |
53833 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53375 ns |
53375 ns |
1 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53708 ns |
53375 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53500 ns |
53625 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
316824 ns |
319120 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75125 ns |
75333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75229.5 ns |
75583 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75458 ns |
75250 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75417 ns |
75500 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46609 ns |
46304 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
322792 ns |
324291 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
324292 ns |
336479.5 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
338167 ns |
324708 ns |
1.04 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
324666 ns |
327458 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
209772.5 ns |
209708.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1489000 ns |
1487583 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1528333 ns |
1522083 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1521500 ns |
1529334 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1466041 ns |
1471333 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
51817 ns |
52335 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5117709 ns |
5126125 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5269375 ns |
5305125 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5316708 ns |
5295000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4982625 ns |
4684000 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
202268 ns |
202194.5 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28333 ns |
28333 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28209 ns |
28333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28291 ns |
28292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28167 ns |
28209 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24489 ns |
24238 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
75000 ns |
66500 ns |
1.13 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66417 ns |
66250 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66791 ns |
66416 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66500 ns |
66625 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
495621 ns |
495044 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1495291.5 ns |
1478812 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1058459 ns |
933416.5 ns |
1.13 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
922042 ns |
1129625 ns |
0.82 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2256500 ns |
2267917 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
575650.5 ns |
577563.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3073042 ns |
3095187.5 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2607125 ns |
2641125 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2542479 ns |
2747417 ns |
0.93 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3810041.5 ns |
3815833.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
1918518 ns |
1965829 ns |
0.98 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7939729 ns |
7798041 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7913958 ns |
8017625 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7954000 ns |
7904083.5 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4793791.5 ns |
4861812 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
78584 ns |
119833.5 ns |
0.66 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
78479 ns |
81604 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
89125 ns |
82000 ns |
1.09 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80500 ns |
80604 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194194 ns |
193857.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2032166.5 ns |
2020000 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2014104 ns |
2021083 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2057917 ns |
2024292 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2018458.5 ns |
1749917 ns |
1.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
739819 ns |
744082.5 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
For #1143 we are applying a manual patch on the optimizers and we print a warning if we can't patch it yet |
2832a6c
to
896675c
Compare
41ef46f
to
d6cd9c8
Compare
919da19
to
960a49a
Compare
test: try fixing load order revert: load order change
960a49a
to
10e11e1
Compare
Needs the following upstream changes: