Skip to content

Commit

Permalink
fix: remove old patches around reactant bug (#1135)
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal authored Dec 15, 2024
1 parent d5e96cd commit d962073
Show file tree
Hide file tree
Showing 2 changed files with 1 addition and 7 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "Lux"
uuid = "b2108857-7c20-44ae-9111-449ecde12c47"
authors = ["Avik Pal <[email protected]> and contributors"]
version = "1.4.1"
version = "1.4.2"

[deps]
ADTypes = "47edcb42-4c32-4615-8424-f2b9edc5f35b"
Expand Down
6 changes: 0 additions & 6 deletions ext/LuxReactantExt/patches.jl
Original file line number Diff line number Diff line change
@@ -1,7 +1 @@
# For some reason xlogx and xlogy with boolean inputs leads to incorrect results sometimes
# XXX: Once https://github.com/EnzymeAD/Reactant.jl/pull/278 is merged and tagged
LuxOps.xlogx(x::TracedRNumber{Bool}) = zero(x)

function LuxOps.xlogy(x::TracedRNumber, y::TracedRNumber)
return invoke(LuxOps.xlogy, Tuple{Number, Number}, float(x), float(y))
end

3 comments on commit d962073

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/121434

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.4.2 -m "<description of version>" d96207396f7e32c118114c78b4a0f53c2cbdcd34
git push origin v1.4.2

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: d962073 Previous: 59c0c69 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3833 ns 4041 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4250 ns 5209 ns 0.82
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4666 ns 5333 ns 0.87
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4041.5 ns 3937.5 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10459 ns 10250 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10417 ns 11083 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10083 ns 11375 ns 0.89
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10625 ns 10542 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1125 ns 1167 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1375 ns 1292 ns 1.06
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1375 ns 1416 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1208 ns 1167 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3958 ns 4020.5 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4125 ns 4250 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4208 ns 4000 ns 1.05
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3958 ns 4166 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57917 ns 70208 ns 0.82
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46459 ns 58667 ns 0.79
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46750 ns 64125 ns 0.73
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82708 ns 79750 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2047958 ns 2033104 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2090000 ns 2103708 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093917 ns 2094916 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1976812.5 ns 2002834 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146708 ns 184125 ns 0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 182667 ns 189792 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145833 ns 186063 ns 0.78
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 143583 ns 185125 ns 0.78
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1151625.5 ns 1118896 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1117646 ns 1163979 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1124084 ns 1120500 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1165146 ns 1129854 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3500 ns 3375 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4083 ns 3917 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4042 ns 5041 ns 0.80
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3916 ns 3333.5 ns 1.17
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9083 ns 9166 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9166 ns 9125 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9125 ns 9042 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8854.5 ns 8625 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17334 ns 19084 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18542 ns 15375 ns 1.21
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17834 ns 18375 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 16333 ns 14625 ns 1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214916.5 ns 225917 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214541 ns 214542 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213500 ns 215125 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220667 ns 213000 ns 1.04
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 791 ns 0.79
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 750 ns 0.78
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 541 ns 1.16
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1458 ns 1417 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1750 ns 1542 ns 1.13
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1458 ns 1833 ns 0.80
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1625 ns 1667 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 6208 ns 8917 ns 0.70
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 6417 ns 0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 8042 ns 0.75
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 10334 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221042 ns 233625 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228959 ns 230375 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229375 ns 230166 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 223854.5 ns 225083 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 4000 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16708 ns 17333 ns 0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 17083 ns 17125 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16875 ns 18416 ns 0.92
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16584 ns 16542 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 570250 ns 602459 ns 0.95
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 577041 ns 612791 ns 0.94
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 576958 ns 611250 ns 0.94
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 573916 ns 609583 ns 0.94
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1424354 ns 1422458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1421125 ns 1432875 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1417666 ns 1432708.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1422417 ns 1421250 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1082874.5 ns 1073292 ns 1.01
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 969583.5 ns 969125 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1345833 ns 1355229 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1275270.5 ns 1303542 ns 0.98
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5772500 ns 5773875 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4552375 ns 4524834 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4981312.5 ns 4956520.5 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5767584 ns 5616459 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 541 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2208 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2250 ns 2167 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2084 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4167 ns 4250 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4375 ns 4125 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4875 ns 4708 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4500 ns 3833 ns 1.17
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11291 ns 11375 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11292 ns 11750 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12000 ns 11875 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11375 ns 11292 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6458 ns 6292 ns 1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6833 ns 6750 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8000 ns 7166 ns 1.12
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6875 ns 6333 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17083 ns 18312.5 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 19250 ns 18083 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17791.5 ns 19833 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17875 ns 18125 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 541 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8792 ns 8959 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8875 ns 8917 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 8916.5 ns 9083 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8209 ns 8542 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64500 ns 96500 ns 0.67
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64583 ns 96458 ns 0.67
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64250 ns 96666.5 ns 0.66
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64750 ns 96458 ns 0.67
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 285625 ns 282542 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 283375 ns 294792 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 276208.5 ns 278250 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 297500 ns 274042 ns 1.09
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3402333 ns 3410792 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3060583 ns 2893584 ns 1.06
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3019687.5 ns 3043771 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4056229 ns 3950938 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7721750 ns 7640458 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7459709 ns 7363916.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7439375.5 ns 7444583 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8277625 ns 8213291 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17593999.5 ns 17504417 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17466354 ns 17685667 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17549604.5 ns 17570042 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 9302166.5 ns 14113396 ns 0.66
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23554916.5 ns 23914500 ns 0.98
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33592458 ns 43551541 ns 0.77
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37227500 ns 37461209 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35248104 ns 34611021 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 188482416 ns 313175916 ns 0.60
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 164033541 ns 178521083 ns 0.92
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 153090042 ns 195096687.5 ns 0.78
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 443063541 ns 279780167 ns 1.58
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 290580729 ns 273572625 ns 1.06
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 257093729.5 ns 278931729 ns 0.92
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 296199833.5 ns 256343958 ns 1.16
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 482390645.5 ns 474930271 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22750 ns 21875 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24645.5 ns 22459 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23792 ns 23250 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21958 ns 21334 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 103459 ns 111375 ns 0.93
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104709 ns 104624.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 103916.5 ns 104666 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103729.5 ns 103604.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5834 ns 5833.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6083 ns 6041 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6625 ns 6667 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6209 ns 5875 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14667 ns 14834 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15020.5 ns 15792 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16020.5 ns 16375 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15250 ns 14834 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3027500 ns 3078645.5 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2071021 ns 2149083 ns 0.96
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2285333.5 ns 2304458.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4820958 ns 4677166 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23646313 ns 23611208 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18048395.5 ns 18335958 ns 0.98
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16906125 ns 17863458.5 ns 0.95
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35430208 ns 35453375 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33437292 ns 33321333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27650521 ns 27967958 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27492875 ns 27533500 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42564979.5 ns 41461333 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72854.5 ns 72791 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73458 ns 73083 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 74021 ns 81187.5 ns 0.91
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75000 ns 73875 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 303958 ns 316333.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219312.5 ns 318437.5 ns 0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219042 ns 323125 ns 0.68
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 319666.5 ns 308937.5 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11500 ns 11625 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11959 ns 12083 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12416 ns 12125 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12208 ns 11959 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26083.5 ns 26834 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26104.5 ns 26959 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27209 ns 27958 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26750 ns 26791.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12166.5 ns 12625 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12645.5 ns 12604.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13500 ns 13958 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12875 ns 12208 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25750 ns 25958 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26459 ns 26333 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26375 ns 26583 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 25833 ns 26541 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182125 ns 179625 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180500 ns 179458 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183000 ns 183083 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 180375 ns 188958 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 581750 ns 595770.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 590708.5 ns 595666 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 609500 ns 584792 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 594250 ns 582042 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5625 ns 5959 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5958 ns 6375 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6500 ns 7125 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 6042 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13917 ns 14166 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13916 ns 14917 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14583 ns 15625 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14291 ns 14458 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1196250 ns 1239500 ns 0.97
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1251708 ns 1321583 ns 0.95
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1274542 ns 1360666.5 ns 0.94
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1013000 ns 1089687 ns 0.93
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4142875 ns 4119041 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4864958 ns 4588250 ns 1.06
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4545520.5 ns 4571375 ns 0.99
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3911541.5 ns 3710875 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1834 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4834 ns 4959 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4916 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5000 ns 4875 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4916 ns 4917 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5250 ns 5792 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5917 ns 6167 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6333 ns 7042 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6042 ns 5875 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11000 ns 11792 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11458 ns 11125 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11292 ns 11250 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11000 ns 10584 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 333 ns 334 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 334 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2708 ns 3000 ns 0.90
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3041 ns 3083 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2792 ns 3041 ns 0.92
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2709 ns 2625 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11167 ns 11875 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11667 ns 11833 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12375 ns 13042 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12083 ns 11292 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25083 ns 24959 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25416 ns 24979.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25167 ns 27250 ns 0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24583 ns 24458 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4250 ns 4291 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4250 ns 4166 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16375 ns 16500 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16417 ns 16166 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16250 ns 16541 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16042 ns 16250 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5791 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5791 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5875 ns 5834 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5791 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20375 ns 20792 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20479.5 ns 20959 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21208 ns 21459 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20854.5 ns 20542 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 427021 ns 412875 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 388041 ns 375208 ns 1.03
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 475333 ns 487209 ns 0.98
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 107750 ns 146584 ns 0.74
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 885834 ns 916708.5 ns 0.97
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 960667 ns 989792 ns 0.97
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1182208 ns 1196125 ns 0.99
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 375875 ns 476875 ns 0.79
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80125 ns 135084 ns 0.59
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80750 ns 81542 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82167 ns 141833 ns 0.58
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80791 ns 135750 ns 0.60
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1942937 ns 1911291.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1918166.5 ns 1946333 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1916333 ns 1928333 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1923604 ns 1910834 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1833 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1833 ns 1792 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6167 ns 6625 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6792 ns 6792 ns 1
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7333 ns 7792 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6667 ns 6666 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8791.5 ns 9667 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9416 ns 9291 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9292 ns 9333 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9167 ns 9417 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 119015458 ns 111820937.5 ns 1.06
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173560375 ns 181915979 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148104416 ns 143480208 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 104510604 ns 92143250 ns 1.13
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 611899646 ns 614702333 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555362500 ns 582318312.5 ns 0.95
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 453017291 ns 456793479.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 632276917 ns 623509562.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 666765667 ns 796858958 ns 0.84
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 666371104 ns 687543333 ns 0.97
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 582119812.5 ns 619636833 ns 0.94
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 866159459 ns 745741417 ns 1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57541 ns 62834 ns 0.92
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47708 ns 47791 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46875 ns 53250 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84375 ns 83083 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1944250 ns 1923354 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1980416 ns 1992584 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1976042 ns 1986708.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1906083 ns 1895062.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267917 ns 266916.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268292 ns 267354.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 267937.5 ns 268666 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267625 ns 264979 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 703792 ns 664125 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 681124.5 ns 694604.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 595667 ns 650292 ns 0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 697208 ns 699958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2209437.5 ns 2256583 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2173708 ns 2246021 ns 0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2200062 ns 2238750 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2113875 ns 2261771 ns 0.93
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5503083 ns 5510583 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5488667 ns 5590125 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5509792 ns 5513333 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5568042 ns 5481479.5 ns 1.02
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 638000 ns 669750 ns 0.95
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 645667 ns 680333 ns 0.95
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 647187.5 ns 678166 ns 0.95
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 644709 ns 674417 ns 0.96
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1827583 ns 1816770.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1720833 ns 1665417 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1720291 ns 1717645.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2097125 ns 2082542 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59166 ns 70125 ns 0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47625 ns 59875 ns 0.80
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45833 ns 52958 ns 0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84209 ns 82666 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2051584 ns 2037917 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2075395.5 ns 2108146 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2040667 ns 2092292 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2021583 ns 2001334 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13373292 ns 13460541.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12436750 ns 12543854 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12559270.5 ns 12654167 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 14986208.5 ns 15261812.5 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47390625 ns 47280959 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41705020.5 ns 42008521 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40992438 ns 40839333.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58725208 ns 58419750 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 73938270.5 ns 97048750 ns 0.76
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 90830563 ns 91157167 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90514083 ns 90856333.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 76122334 ns 76444354 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59916 ns 72334 ns 0.83
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47541 ns 47292 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47458 ns 65375 ns 0.73
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83500 ns 82584 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1948584 ns 1929937 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1954250 ns 1984583.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1965437.5 ns 1983584 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1888625 ns 1888750 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 417 ns 0.80
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 5979.5 ns 6541 ns 0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6584 ns 6458 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6583 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6187.5 ns 5958 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2666 ns 2917 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2834 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2792 ns 2875 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2666 ns 2625 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 286733687.5 ns 279890375 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339568833 ns 347812250 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 314522187.5 ns 310658166.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 270045166 ns 261239625 ns 1.03
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1015582292 ns 994066791 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 953582875 ns 960267958 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 840575375 ns 837209229.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1282644084 ns 1129871667 ns 1.14
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1419694479.5 ns 1752205958 ns 0.81
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1672572375 ns 1693119292 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1620047667 ns 1650193041 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1358918958.5 ns 1306363020.5 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1454458 ns 1458375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1408583 ns 1463959 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1410041.5 ns 1465625 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1442292 ns 1459625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5055625 ns 5012416 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5019625 ns 5066791 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5009458 ns 5033750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5053667 ns 5030375 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 171675979 ns 158175666 ns 1.09
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 126429812.5 ns 166759458.5 ns 0.76
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 106760875 ns 90721479 ns 1.18
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 165741833.5 ns 151859250 ns 1.09
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 622640208 ns 669929250 ns 0.93
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 492172500 ns 560789291 ns 0.88
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 462809167 ns 487588708 ns 0.95
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 660164833 ns 651112083 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8982250 ns 8927708.5 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8969792 ns 9111000 ns 0.98
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7891125 ns 7978437.5 ns 0.99
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9977959 ns 10091416 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36106959 ns 36693146 ns 0.98
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37109917 ns 39523229 ns 0.94
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33736459 ns 34135874.5 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 39159896 ns 59280958 ns 0.66
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47375 ns 47437.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47500 ns 47500 ns 1
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47645.5 ns 47542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47500 ns 47292 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50417 ns 50417 ns 1
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50875 ns 50458 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 51729 ns 50708 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50333 ns 50333 ns 1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6583 ns 7145.5 ns 0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7208 ns 7292 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7646 ns 8084 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7333 ns 6667 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9292 ns 10417 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10209 ns 10166 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10333 ns 10250 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10167 ns 9833 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5854.5 ns 6333 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6292 ns 6375 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6834 ns 7479.5 ns 0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6166 ns 5166 ns 1.19
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12667 ns 13709 ns 0.92
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13208.5 ns 13000 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13459 ns 13542 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12958 ns 13416.5 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 1125 ns 0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 1041 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1083 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7770.5 ns 8333 ns 0.93
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 8084 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7834 ns 8042 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8250 ns 7875 ns 1.05
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23417 ns 23791.5 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23375 ns 23209 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23500 ns 23291 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23458 ns 23041 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52292 ns 52583 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52667 ns 52625 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52667 ns 52833 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52417 ns 52417 ns 1
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1448145.5 ns 1458625 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1457021 ns 1464021 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1402542 ns 1466000 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1403042 ns 1454708 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5036750 ns 5020749.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5020979 ns 5048500 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5021708 ns 5032583 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5042708.5 ns 5015271 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3054459 ns 3133854.5 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2092750 ns 2152167 ns 0.97
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2302708.5 ns 2319584 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4935833 ns 4994354 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24359708.5 ns 24444667 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18879875 ns 19072896 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17805083 ns 19040875 ns 0.94
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36477083 ns 36840083 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 34112104.5 ns 34088208 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28352833 ns 28581417 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27995625 ns 28009625 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42341709 ns 41680458.5 ns 1.02
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 143179166 ns 141268000 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 147785458 ns 143350625 ns 1.03
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126873458.5 ns 120743271 ns 1.05
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 172641167 ns 188129709 ns 0.92
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1416291312.5 ns 2324854792 ns 0.61
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1304509479 ns 841095084 ns 1.55
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1238526750 ns 1147318167 ns 1.08
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 685736000 ns 833862645.5 ns 0.82
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76042 ns 84125 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 79459 ns 78250 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76687 ns 76312 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75124.5 ns 71667 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 189375 ns 290458 ns 0.65
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 278000 ns 292000 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 289166.5 ns 305208 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 193709 ns 288208.5 ns 0.67
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35548875.5 ns 35368791 ns 1.01
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36247291.5 ns 36524083.5 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32430687.5 ns 31361542 ns 1.03
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40776042 ns 38859354 ns 1.05
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148827666 ns 148171584 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 152471625 ns 157709333 ns 0.97
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 135828541 ns 137631188 ns 0.99
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 224259958 ns 150161812.5 ns 1.49
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120283062 ns 111509000 ns 1.08
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173757375 ns 181918104.5 ns 0.96
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148381833 ns 143432542 ns 1.03
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 100995854 ns 94189375.5 ns 1.07
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 468476625 ns 497837834 ns 0.94
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466581667 ns 512628166 ns 0.91
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 438033125 ns 440382167 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 758068771 ns 678623500 ns 1.12
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 656498666 ns 644936208 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 639464917 ns 676380021 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 572772729.5 ns 603539166.5 ns 0.95
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 867522166 ns 727707084 ns 1.19
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1241166.5 ns 1357667 ns 0.91
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 960584 ns 795375 ns 1.21
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 985604 ns 995750 ns 0.99
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2040750 ns 2104875 ns 0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 3033584 ns 2829624.5 ns 1.07
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2618542 ns 2513417 ns 1.04
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2633875 ns 2616854 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3767750 ns 3785792 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5830292 ns 5815812.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5796375 ns 5906250 ns 0.98
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5804458 ns 5802125 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2978917 ns 2884250 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7500 ns 8084 ns 0.93
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 6333 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6209 ns 7042 ns 0.88
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10333 ns 10541 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212708 ns 213645.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220542 ns 255312.5 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223542 ns 220667 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 208708 ns 205667 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 297468334 ns 293659209 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 215016959 ns 259757583 ns 0.83
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 193569000 ns 158085937.5 ns 1.22
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 311798792 ns 293331625 ns 1.06
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1238998917 ns 1087845916.5 ns 1.14
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 901957166.5 ns 950749875 ns 0.95
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 825878542 ns 812442750 ns 1.02
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1319998292 ns 1143172250 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5542 ns 5542 ns 1
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5834 ns 6084 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6708 ns 7020.5 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 4958 ns 1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7083 ns 7292 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7333 ns 7542 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7875 ns 7541 ns 1.04
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7042 ns 7042 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9083 ns 9167 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 8666 ns 9250 ns 0.94
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9292 ns 9416 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8583 ns 9084 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351792 ns 380958 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351708 ns 352792 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352375 ns 352625 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 354000 ns 350834 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 827667 ns 832062.5 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 779562.5 ns 827458 ns 0.94
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 778208 ns 775062.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 824354.5 ns 823188 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 337833 ns 335250 ns 1.01
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 342521 ns 327208 ns 1.05
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 452875 ns 451729 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 11687.5 ns 12458 ns 0.94
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 713208.5 ns 711291 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 736500 ns 735541 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1010250 ns 1004041 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 27208.5 ns 26666 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 381792 ns 375354.5 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 354187 ns 336667 ns 1.05
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 441708 ns 439084 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 31083 ns 28875 ns 1.08
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 731646 ns 720625 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 785667 ns 804333.5 ns 0.98
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1027917 ns 1027667 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 91083 ns 104125 ns 0.87
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3542 ns 3500 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3458 ns 3833 ns 0.90
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3583 ns 3750 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3542 ns 3334 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4167 ns 4208 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4250 ns 4375 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4500 ns 4583 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4208 ns 4375 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3375 ns 3625 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3917 ns 3917 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4084 ns 4667 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 3729 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8479.5 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8167 ns 8750 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8584 ns 8584 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8500 ns 1
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204791 ns 206959 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210875 ns 212916 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211541 ns 214834 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 202083 ns 200708 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 600417 ns 649583.5 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 627875 ns 623333 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 630312 ns 622250 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583542 ns 613479 ns 0.95
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1010270.5 ns 1236916 ns 0.82
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1015521 ns 1300167 ns 0.78
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 949979.5 ns 1184250 ns 0.80
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 909416 ns 1155667 ns 0.79
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4557687.5 ns 4569500 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4722959 ns 4789500 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4470333.5 ns 4471334 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4443646.5 ns 4277000 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3334 ns 3500 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3500 ns 3708 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4125 ns 4500 ns 0.92
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3625 ns 3084 ns 1.18
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7292 ns 7542 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7167 ns 7500 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7167 ns 7458 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7458.5 ns 6750 ns 1.10
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1562000 ns 1661083 ns 0.94
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1179000 ns 1212459 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1346417 ns 1388375 ns 0.97
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2481104 ns 2367291.5 ns 1.05
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12361833 ns 12379333 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9575979 ns 9634187.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9245041 ns 9303250.5 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18149645.5 ns 17994791.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17389625 ns 17400125 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14446583 ns 14391542 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14298208.5 ns 14366500 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21068500 ns 20976166.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88500 ns 134083 ns 0.66
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 99167 ns 134145.5 ns 0.74
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91917 ns 140125 ns 0.66
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90708.5 ns 133834 ns 0.68
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2074916 ns 2067833 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2029541 ns 2021792 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1761250 ns 2040375 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2035041.5 ns 2038229.5 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 2084 ns 1250 ns 1.67
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2666 ns 1542 ns 1.73
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3583.5 ns 3500 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1916 ns 1041.5 ns 1.84
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2625 ns 2792 ns 0.94
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2875 ns 2791 ns 1.03
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2917 ns 2834 ns 1.03
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2834 ns 2687.5 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 8084 ns 0.91
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 6416 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 6916 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10583 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212333.5 ns 224958 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220563 ns 230000 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223084 ns 220875 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 208417 ns 206709 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3667 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14709 ns 14625 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14625 ns 14250 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14541 ns 14625 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14292 ns 14458 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 94500 ns 145791 ns 0.65
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93916.5 ns 141583 ns 0.66
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96125 ns 142459 ns 0.67
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 95625 ns 141375.5 ns 0.68
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1950959 ns 1928792 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1918895.5 ns 1919959 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1651334 ns 1933062.5 ns 0.85
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1942375 ns 1928146 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 881833 ns 870875 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 830792 ns 819625 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1225417 ns 1235083 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 944312.5 ns 966104.5 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2742708 ns 2825084 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2522750 ns 2525875 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3329959 ns 3358499.5 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3361458 ns 3396917 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15166.5 ns 14750 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17000 ns 15375 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16583 ns 16875 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15667 ns 14958 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 214666 ns 261458 ns 0.82
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 224541.5 ns 259875 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216208 ns 216042 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217645.5 ns 220500 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 219500 ns 219729.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 220000 ns 220479 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 221167 ns 223250 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 220834 ns 221354.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 495958 ns 510541.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 507958 ns 506917 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 498625 ns 498667 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 506541 ns 512312.5 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4166.5 ns 3667 ns 1.14
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4312.5 ns 4833 ns 0.89
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4583 ns 5167 ns 0.89
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 4625 ns 3979.5 ns 1.16
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7187.5 ns 7625 ns 0.94
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7292 ns 7375 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7229.5 ns 7333 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7625 ns 7250 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17417 ns 18792 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19292 ns 17542 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18625 ns 19791 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18500 ns 18125 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219083.5 ns 252708 ns 0.87
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 211959 ns 213500 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213521 ns 214541 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213208 ns 214000 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4250 ns 4229.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4334 ns 4916 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4750 ns 5417 ns 0.88
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4375 ns 4291.5 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10417 ns 10750 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10750 ns 11042 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10500 ns 10875 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10500 ns 10042 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 2958 ns 3292 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3417 ns 3625 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3959 ns 4167 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3542 ns 3459 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7291 ns 7750 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7458 ns 7791 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 7666 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 7292 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23616833 ns 23600875 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34076542 ns 43903313 ns 0.78
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37648750 ns 37710791.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 35355896 ns 34490521 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 185118750 ns 191551625 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 161569416 ns 186643917 ns 0.87
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146021041.5 ns 145792667 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 274915208 ns 271888584 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 273527291 ns 292672562 ns 0.93
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 244066854 ns 266647854 ns 0.92
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 231262500 ns 299377291.5 ns 0.77
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 325681645.5 ns 325821396 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183916.5 ns 184041 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184479.5 ns 182292 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 183709 ns 184917 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 185125 ns 183667 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 635250 ns 632125 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 590375 ns 596250 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 586375 ns 589146 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 586875.5 ns 634646 ns 0.92
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3912854 ns 3923584 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3922688 ns 4065250 ns 0.96
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3534875 ns 3605250 ns 0.98
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4683208 ns 4910271 ns 0.95
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17461333 ns 16427166.5 ns 1.06
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17877604 ns 17546270.5 ns 1.02
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16535333 ns 15424750 ns 1.07
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20876542 ns 41363334 ns 0.50
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 583 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 625 ns 0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8875 ns 9500 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9458 ns 9500 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9167 ns 9792 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9084 ns 9541 ns 0.95
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 653952292 ns 513820542 ns 1.27
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 393857103.5 ns 535432083 ns 0.74
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 328714250 ns 355647999.5 ns 0.92
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 759532875 ns 672007125 ns 1.13
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1886540417 ns 1968156417 ns 0.96
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1638767625 ns 1778975000 ns 0.92
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1505416479 ns 1508167229 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2232982666.5 ns 2144133562.5 ns 1.04
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1645500 ns 1659562.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1196083 ns 1222625 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1372166 ns 1402292 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2490500 ns 2420750 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12742021.5 ns 12714958 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9937333.5 ns 10033625 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9670291 ns 9669250 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18551458 ns 18444395.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17729729 ns 17720021 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14747250 ns 14836625 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14539958 ns 14593959 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21491875 ns 21470916.5 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26291 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26250 ns 26208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67416 ns 67292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67167 ns 67166 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68042 ns 67437.5 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66917 ns 67125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203916 ns 206334 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209500 ns 212084 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 208375 ns 211708 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199583 ns 200042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 615979 ns 652229 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 622458.5 ns 673167 ns 0.92
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 625042 ns 623750.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 628771 ns 594625 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 654750 ns 689375 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 648792 ns 686646 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 639250 ns 603125.5 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 553000 ns 595854 ns 0.93
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2255292 ns 2275292 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2216833.5 ns 2318250 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2230625 ns 2234167 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2261625 ns 2258041 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17479.5 ns 17208 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18166 ns 16708.5 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 18334 ns 18542 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18542 ns 26209 ns 0.71
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230250 ns 233041 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218666.5 ns 238708 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220145.5 ns 220895.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 225083.5 ns 247479 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 541 ns 583 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9625 ns 9917 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9500 ns 9916.5 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9583 ns 10166 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9583 ns 9709 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5166 ns 5375 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5667 ns 5667 ns 1
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6291.5 ns 7208 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5625 ns 5166 ns 1.09
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6959 ns 7750 ns 0.90
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7709 ns 7417 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7125 ns 7750 ns 0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7208 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2292 ns 2459 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2125 ns 2375 ns 0.89
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2333 ns 2250 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2167 ns 2042 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6354.5 ns 6667 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6500 ns 6625 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6583.5 ns 6667 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6459 ns 6500 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 748750 ns 781188 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746708 ns 762250 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 749375 ns 746542 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 749125 ns 746084 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 794125 ns 815833 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 775500 ns 816958.5 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 775812.5 ns 775937.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 794500.5 ns 810604.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7458 ns 8042 ns 0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6084 ns 6417 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5583 ns 6958 ns 0.80
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10541 ns 10625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 231542 ns 265000 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 231875 ns 268728.5 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229604 ns 229125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215187.5 ns 217229 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10166.5 ns 10375 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10416 ns 10646 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10479 ns 11292 ns 0.93
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10417 ns 10125 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25083.5 ns 24542 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23916 ns 25354.5 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24625 ns 25500 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25000 ns 24333 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106424375 ns 106479791.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117279208.5 ns 126041750 ns 0.93
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120424354 ns 120943833 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117916208 ns 117512916.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 397131541.5 ns 384219250 ns 1.03
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366183958 ns 372791166.5 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 355277020.5 ns 338002625 ns 1.05
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 545563875.5 ns 471273750 ns 1.16
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 609770291 ns 803612958.5 ns 0.76
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 756955334 ns 771462084 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 745569813 ns 812264500 ns 0.92
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 607706416.5 ns 607987313 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6875 ns 7042 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9229 ns 7208 ns 1.28
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8833 ns 8166.5 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7500 ns 6750 ns 1.11
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14375 ns 14333 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13750 ns 14750 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14667 ns 14000 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13542 ns 13792 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5959 ns 6209 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6354.5 ns 6417 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7083 ns 7500 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6042 ns 5792 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12666 ns 12875 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12917 ns 12583 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12916 ns 13125 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12292 ns 12333 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5875 ns 5042 ns 1.17
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 5937.5 ns 5625 ns 1.06
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 5812.5 ns 6250 ns 0.93
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 6000 ns 5958 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15375 ns 15666 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 18229.5 ns 15709 ns 1.16
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 15625 ns 15583 ns 1.00
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15834 ns 15458 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 334 ns 417 ns 0.80
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 417 ns 375 ns 1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6291 ns 6542 ns 0.96
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6541 ns 6542 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6375 ns 6542 ns 0.97
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6042 ns 6208 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5958 ns 5875 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5917 ns 5916 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6083 ns 5833 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5833 ns 5834 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 20895.5 ns 22729 ns 0.92
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21084 ns 21625 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21334 ns 21667 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 20875 ns 20854.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 145167 ns 192437 ns 0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145333 ns 194875 ns 0.75
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147791 ns 190958 ns 0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 146250.5 ns 198042 ns 0.74
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1351583 ns 1364250 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1324833.5 ns 1373333.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1269708 ns 1330458 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1342020.5 ns 1326229.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24854 ns 23125 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24750 ns 23000 ns 1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24083.5 ns 24041 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23041.5 ns 21667 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 130333 ns 131208 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 131875 ns 183125.5 ns 0.72
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 120583 ns 118667 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 127250 ns 180917 ns 0.70
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6833 ns 0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6750 ns 6667 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6167 ns 6833 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6166 ns 6417 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4250 ns 4542 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4583 ns 5229.5 ns 0.88
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5000 ns 5125 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4666 ns 4666 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9917 ns 10334 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10000 ns 10625 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10458 ns 10375 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10250 ns 10375 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1584 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5625 ns 6042 ns 0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6000 ns 6000 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5792 ns 5959 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5666 ns 5625 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6809750 ns 6837750 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6375834 ns 6418708 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6505250 ns 6547416.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7653125.5 ns 7628667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24098271 ns 24126020.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21313750 ns 21396208 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21034292 ns 20992000 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29936333.5 ns 29707541 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37354916.5 ns 48614958 ns 0.77
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45524125 ns 45739708 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45728625 ns 45440458 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 38256604.5 ns 38260167 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5708 ns 5917 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5916 ns 6083 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6542 ns 7041 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5958 ns 5708 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8792 ns 8583 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8375 ns 8959 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8792 ns 8417 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8042 ns 8125 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1544521 ns 1564625 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1274291.5 ns 1276958 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1619792 ns 1632792 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2113874.5 ns 2147187.5 ns 0.98
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7917042 ns 7938667 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6631541 ns 6675417 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7090646 ns 7179229.5 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10525708 ns 10466792 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 363667 ns 375979.5 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 373917 ns 356791.5 ns 1.05
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 456000 ns 453958 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 24312 ns 31791.5 ns 0.76
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 737791.5 ns 724250 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 796895.5 ns 820708 ns 0.97
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1063396 ns 1064167 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 91145.5 ns 93125 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397459 ns 413500 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287666 ns 220417 ns 1.31
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287958 ns 305958 ns 0.94
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 751208 ns 758417 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 667375 ns 664291 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 532500 ns 464750 ns 1.15
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 533459 ns 524625 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974250 ns 971875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 677250 ns 660125 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 646333 ns 688833 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 555812.5 ns 599208.5 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 589334 ns 676041 ns 0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2506042 ns 2465396 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2452187.5 ns 2549750 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2421083 ns 2454750 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2509083.5 ns 2436396 ns 1.03
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 3042 ns 2084 ns 1.46
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 3500 ns 2500 ns 1.40
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 3709 ns 4584 ns 0.81
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 2834 ns 2000 ns 1.42
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5458 ns 5541 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5625 ns 5625 ns 1
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5625 ns 5541 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5583 ns 5459 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1459917 ns 1479917 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1499291 ns 1515750 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1501417 ns 1523083 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1439583 ns 1448834 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5106812.5 ns 5170937.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5286437.5 ns 5319792 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5284041.5 ns 5296208 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4996333.5 ns 4989229.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3708 ns 3666 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3666 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15250 ns 15458 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15417 ns 15292 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15416 ns 15500 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15000 ns 15250 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71500 ns 96375 ns 0.74
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71333 ns 104834 ns 0.68
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 70542 ns 94000 ns 0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71250 ns 92875 ns 0.77
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 319958 ns 319291 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 318333 ns 326792 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 318208 ns 317083 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 321834 ns 317375 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 1083 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 959 ns 1000 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7916 ns 8458 ns 0.94
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8208 ns 8167 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 8500 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7667 ns 8000 ns 0.96
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 514834 ns 536458.5 ns 0.96
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 490208 ns 514770.5 ns 0.95
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 567542 ns 583167 ns 0.97
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 218520.5 ns 177291.5 ns 1.23
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1371833 ns 1430708 ns 0.96
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1457062.5 ns 1491625 ns 0.98
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1755667 ns 1790583 ns 0.98
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 909250 ns 862187.5 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6166 ns 6750 ns 0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6708 ns 6458 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6125 ns 6666 ns 0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6083 ns 6584 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1721334 ns 1721104 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1725146 ns 1775187.5 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1724500 ns 1796833.5 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1728229.5 ns 1760583 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4358375 ns 4395271 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4376792 ns 4422959 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4335333 ns 4375792 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4390375 ns 4339937.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6750 ns 16708.5 ns 0.40
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6625 ns 7042 ns 0.94
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6875 ns 8000 ns 0.86
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6542 ns 7125 ns 0.92
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32500 ns 52520.5 ns 0.62
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 50895.5 ns 74791 ns 0.68
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 32875 ns 33083 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 49729 ns 43000 ns 1.16
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17937.5 ns 17333 ns 1.03
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 18042 ns 17875 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18125 ns 18229.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 18458 ns 17708 ns 1.04
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53208 ns 53541.5 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53250 ns 53500 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53250 ns 53500 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53562.5 ns 53542 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75709 ns 102541.5 ns 0.74
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75291 ns 109541 ns 0.69
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75208 ns 99500 ns 0.76
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75250 ns 97875 ns 0.77
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 330270.5 ns 328250 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 328625 ns 333084 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 325083 ns 324125 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 329042 ns 324041 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1486375 ns 1504750 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1526375 ns 1541208 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1527375 ns 1549666 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1464666 ns 1472416.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5175375 ns 5156854.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5310021 ns 5311833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4950479 ns 5311062.5 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5010146 ns 4595917 ns 1.09
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28208 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28375 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28292 ns 28125 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28250 ns 28208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66292 ns 66917 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66375 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66459 ns 67750 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66459 ns 66500 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1396208.5 ns 1505459 ns 0.93
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1137042 ns 959542 ns 1.18
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1061959 ns 1085458.5 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2245417 ns 2196437.5 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 2966209 ns 3106250 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2741250 ns 2641667 ns 1.04
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2597667 ns 2753084 ns 0.94
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3844125 ns 3807583 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7918709 ns 7926875 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7905417 ns 8046333.5 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7547354 ns 7926812.5 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4916042 ns 4419125 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80583 ns 134333 ns 0.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81458 ns 140333 ns 0.58
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81541 ns 135750 ns 0.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80709 ns 136000 ns 0.59
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2026042 ns 2042250 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026125.5 ns 2053604 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1719750 ns 2031125 ns 0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2018208 ns 2012625 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.