forked from uxlfoundation/oneDNN
-
Notifications
You must be signed in to change notification settings - Fork 49
[DRAFT]Dnn38 arm #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
alvoron
wants to merge
2,088
commits into
v3.6_for_ie_master
Choose a base branch
from
dnn38_arm
base: v3.6_for_ie_master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[DRAFT]Dnn38 arm #285
Changes from 1 commit
Commits
Show all changes
2088 commits
Select commit
Hold shift + click to select a range
ce085ff
cpu: x64: jit_reorder: add verbose messages
dzarukin 4832ccc
benchdnn: self: replace temporary "const char *" with "std::string"
dzarukin 67c3042
cpu: x64: fixed memory leak in jit_uni_ncsp convolution impl
dzarukin 607a318
ngen: update PVC WAR bug workaround
petercad caa770a
benchdnn: inputs: graph: fix test cases related to int8/f8 add
TaoLv 5d7ed69
Yobodovs/amx blocking heuristics fixes (#2938)
yair-obodovsky 7d418f5
xe: ocl: fix gemm_with_po verbose dispatch message
petercad 931cc27
ngen: downstream nGEN
rjoursler 068b775
cpu: aarch64: add brgemm bwd data support for block size 8 and 16
rpushkarr 6bc6597
graph: backend: dnnl: introduce internal dnnl_sdpa op
71f1837
build: removed -fcf-protection build option for old GCC
vpirogov 114cd34
benchdnn: add per test case timer
dzarukin 1a2f7c3
benchdnn: add message for not found files
dzarukin 6ffa939
benchdnn: add summary execute timer
dzarukin 5f6951b
benchdnn: add summary create_pd and create_prim timers
dzarukin 4107fb9
riscv64: update intrinsics
zhangfeiv0 32f71c2
riscv64: fix clang-format error
zhangfeiv0 37b4972
riscv64: fix clang-format error
zhangfeiv0 190bf99
riscv64: fix clang-format error
zhangfeiv0 f8c3ab0
riscv64: update cmake
zhangfeiv0 5de25f3
riscv64: update cmake
zhangfeiv0 2a35904
gpu: intel: sycl: workaround failing atomics support
mgouicem 3e40d66
gpu: intel: sycl: level zero query fixup
mgouicem 201b16a
xe: jit: gemm: reduce grf consumption for fp4 strategies
dyoussif 5652be3
common, xe: sycl: improve logging when OpenCL install is missing
rjoursler 5171c2a
gpu: intel: ocl: add s32 support for binary
yehudaorel e1c08d7
gpu: intel: ocl: enable s32 for binary primitive
yehudaorel 4e44b16
tests: benchdnn: gpu: enable s32 dt in binary
yehudaorel 6a7f559
xe: ocl: prevent double -> float literal conversion
yehudaorel 4d61ba4
xe: ocl: prevent double -> float literal conversion fix
yehudaorel df1bbe2
xe: ocl: prevent double -> float clang fix
yehudaorel eaaf1c0
third_party: ngen: prepare for SYCL generator usage
echeresh a575629
gpu: enable SYCL generator for nGEN kernels
echeresh c667c39
xe, sycl: eliminate intermediate kernel binary
echeresh 699de16
xe: jit: remove outdated comments
echeresh 4c3596d
xe: jit: codegen: prevent IR -> nGEN assembly functional changes
rjoursler 1b3890f
xe: jit: move require_signal_header into exec_cfg
rjoursler ff05a37
x64: brgemm: split brgemm_blocking function for tmm and vmm
ankalinin 1498c83
x64: brgemm: update code for tmm brgemm blocking
ankalinin 762e317
generic: sycl: Adding support for RNN FWD r2l, sum & concat
ShanoToni 9bef39e
ngen: fix missing field initialization warning
rjoursler 7dc74a9
cmake: limit host compiler dpcpp warning divergence
rjoursler dc9eca4
graph: backend: dnnl: add reshape pass to support 5D GQA
df65f8c
graph: backend: dnnl: refine check and layout propagation for gqa
879eefd
benchdnn: inputs: graph: add gqa v2 case
396fdfc
all: clean graph compiler backend
TaoLv 61821f1
xe: ir: add 4-bit types
atkassen 2205a94
xe: jit: codegen: remove unused parameter
atkassen 0b17a99
xe: jit: codegen: use offset-based interfaces
atkassen aa20318
xe: jit: ir: adjust sizes/offsets for packed types
atkassen 42aaded
xe: jit: ir: add assertion for sub-byte type packing
atkassen 91c60af
xe: jit: reorder: remove hf8 workarounds
atkassen 797b867
xe: jit: reorder: prevent scalar mov in 2d impl
atkassen 11a8548
xe: jit: reorder: enable fp4 up-convert
atkassen 425ae14
xe: jit: reorder: enable fp4 down-convert
atkassen bb6819e
xe: jit: codegen: fix dst width for asm dumping
atkassen 633978a
xe: jit: address clang-tidy complaints
atkassen 333ef80
graph: interface: op: matmul supports mixed data types
TaoLv 697db52
graph: interface: op: softmax supports mixed data types
TaoLv 60c6d5e
graph: interface: op: binary ops support mixed data types
TaoLv 27f277c
examples: graph: sdpa: define with f32 intermediate data type
TaoLv fe974de
graph: backend: dnnl: ukernel sdpa only supports f32 intermediates
TaoLv 00dbb2a
graph: backend: dnnl: pattern: sdpa: remove xf16 check from gpu pattern
TaoLv 50e3ea3
benchdnn: inputs: graph: add sdpa cases with f32 intermediate type
TaoLv f19ecf9
benchdnn: inputs: graph: test f32 intermediates for implicit mask
TaoLv 9b54354
doc: graph: op: update supported data types
TaoLv 8a04bcc
graph: backend: dnnl: support intermediate data type in decomp kernel
aeaa73f
cpu: aarch64: default num_threads to max for acl_threadpool
Sqvid 11aa54d
doc: graph: fusion patterns restructure (#2952)
ranukund c6fab66
x64: conv: add f16 jit dw conv for avx512_core_fp16
tczeszun a03a5bb
xe: jit: gemm: db: reinfo a BOS and SOS strategy
dyoussif bebdb6b
xe: jit: gemm: handle data type alignment requirements more strictly
dyoussif d1c9a1b
xe: jit: gemm: db: fixup strategy alignments
dyoussif 3bfdbfa
xe: jit: gemm: db: fixup out of regs
dyoussif 616cbaa
xe: avoid copies
atkassen a6b3c47
xe: add missing ctors/dtors/assignment operators
atkassen 71387b6
xe: remove unnecessary/dangerous moves
atkassen 46b79d4
xe: remove unused code
atkassen 7c65b29
xe: jit: codegen: remove dead code
atkassen 5d9f7ad
xe: jit: ir: avoid overflow
atkassen 3ff9c05
xe: jit: ir: use `type.packing()` interface
atkassen c2f590d
xe: jit: codegen: gracefully handle bad float division
atkassen d54fe49
xe: jit: address clang-tidy complaints
atkassen 4d9a89c
doc: fixup fp8 support documentation
kealan-barbieri b04fcca
xe: jit: conv: reduce min hw for fp8 support
kealan-barbieri e20f9b5
doc: Add Xe2 architectures
kealan-barbieri d467687
xe: ocl: reorder: allow more type combinations
atkassen 224bb94
tests: benchdnn: adjust reorder fill range for hf8
atkassen 28b4c27
xe: sdpa: Update configs for head sizes of 128 and 256
umar456 f46c8a3
xe: sdpa: move all 64 head size configurations to xe2 for LNL and BMG
umar456 ca053d7
xe: sdpa: Refactor condition macros in init function
umar456 e4e4818
xe: sdpa: Add new configurations for head size of 256 on xe2
umar456 1521345
xe: sdpa: additional sdpa config updates from expanded configuration
umar456 d13fd38
cpu: x64: fix invalid immediate encoding in cpu_reducer
rjoursler b552ce7
cpu: x64: fix invalid immediate offset in avx 1x1 convolution
rjoursler 2d41f6f
cpu: binary: disable broadcast check for select op
avmanerikar 55d6ac5
graph: backend: dnnl: verbose log enhancement
rongzha1 f5baa4b
scripts: verbose converter: strengthen type-hinting for attributes
atkassen 84c7f3f
scripts: verbose converter: simplify attribute formatting
atkassen 2074cf4
scripts: verbose converter: use "any" as default binary po tag
atkassen 5cf9db9
cpu: remove extra size checks
ankalinin 0b6e3ba
x64: update has_large_size function
ankalinin bef2e40
cpu: conv_list: add x8:s8:f16 combination
dzarukin 132703e
benchdnn: prim_ref: use tag::any for binary and f32 for sum po
dzarukin 7b39578
ngen: workaround for SYCL + GCC 12.3 compiler bug
petercad 48e6b97
xe: ocl: enable ref fp4 conv
kealan-barbieri d30d609
src: common: limit fp4 convs to even dims
kealan-barbieri 1dcf33e
tests: benchdnn: enable fp4 conv tests, inputs
kealan-barbieri 5f3a9dc
xe: jit: conv: enable fp4 support
kealan-barbieri 55b000c
src: common: add convolution pack scratchpad tag
kealan-barbieri c37b483
xe: ocl: ref convolution dst fp4 support
kealan-barbieri ef5e699
benchdnn: graph: support op kind rewrite for binary/eltwise
wzt1997 a27c348
benchdnn: graph: improve case log with new knob
wzt1997 a67c2ed
benchdnn: graph: inputs: use op-kind rewrite for scale in SDPA
wzt1997 1495519
benchdnn: graph: improve doc for --op-kind knob
wzt1997 bc555fd
benchdnn: graph: inputs: use op kind rewrite for binary op testing
wzt1997 cba91c3
benchdnn: graph: inputs: use op kind rewrite for eltwise op testing
wzt1997 7388893
cpu: x64: matmul: correct blocked_B layout initialization (#3007)
xuxinzen a474e3a
benchdnn: graph: separate mem filling and create mem from graph path
wzt1997 06a7e82
benchdnn: graph: remove useless value for reduction
wzt1997 dde3af7
benchdnn: graph: use default value from benchdnn for no ref mem
wzt1997 5994eb7
benchdnn: graph: remove check for no_ref_mem
wzt1997 4fc4e5a
fixup: build: bumped version to v3.8.0
vpirogov 64f78da
github: workflows: bump KyleMayes/install-llvm-action
dependabot[bot] 7584871
xe: sdpa: add configs for head_size of 512
syurkevi 077763a
tests: sdpa: add complex_fusion tests for head size 512
syurkevi b818143
xe: sdpa: enable 32-wide block loads for DG2
syurkevi ce928e6
xe: sdpa: refactor config selection to separate header
syurkevi bfc4cac
xe: sdpa: update configs for xe2 granularity
syurkevi fce8dda
xe: sdpa: address coverity issues
syurkevi 814d0e9
xe: sdpa: enable head size 576 for f16
syurkevi 642d110
xe: jit: gemm: handle sub-byte 'any' tags
dyoussif 8bbf199
xe: jit: gemm: fixup out of reg
dyoussif fdefc68
benchdnn: matmul: remove invalid int4 zp cases
dyoussif f6ed545
github: workflows: bump lukka/get-cmake from 3.31.5 to 3.31.6
dependabot[bot] a1e553e
scripts: verbose converter: allow post-op duck typing
atkassen f840512
graph: backend: dnnl: fix genindex build on NV GPU
Jiexin-Zheng decb08c
gtests: graph: fix incorrect layout expectation
Jiexin-Zheng 19bfa32
graph: backend: dnnl: disable binary+sqrt fusion on NV GPU
Jiexin-Zheng 910e36d
gtests: graph: unit: add binary+sqrt case
Jiexin-Zheng 032bc7a
gtests: graph: unit: add compile option for ptx
Jiexin-Zheng 41ef402
graph: backend: dnnl: fix sdpa build on NV GPU
Jiexin-Zheng 5243796
benchdnn: graph: fix emplace
TaoLv 69545e2
benchdnn: graph: fix naming style of deserialized_lt_t
TaoLv d727bbe
benchdnn: graph: fix naming style of sycl_deletor_t
TaoLv af78dcf
graph: utils: pm: check pointer before dereference
TaoLv 9554374
graph: backend: dnnl: aovid unnecessary copy
TaoLv c960e96
src: gpu: intel: jit: gemm: add dual (src+wei) vector zero points
hidefromkgb b073921
gpu: intel: sycl: l0: remove dependency to OCL for atomics query
mgouicem cea5462
cmake, doc: add GROUP_NORMALIZATION value for ONEDNN_ENABLE_PRIMITIVE
mzhukova 4795c31
xe: sdpa: Add support for bottom right causal mask type
umar456 9674952
xe: sdpa: Use append instead of set for opencl argument assignment
umar456 b08cd59
xe: sdpa: pass attn_mask_type as int using compiler definitions in ocl
umar456 03fafc7
[FORK][FEATURE] Enable jit sse41 NxN convolution for grayscale input
8b35c43
[FORK][FEATURE] Support of strided blobs for [de]convolution and simp…
luweizhou2016 9adee10
[FORK][FEATURE] Updated sse41 jit convolutions to support padded chan…
1e093be
[FORK][FEATURE] Introduced Depthwise and Quantization post ops
6afdaff
[FORK][FEATURE] TBB_AUTO was enabled
alexey-varyzgin fff5ef7
[FIX] nchw_pooling dense fix
alexey-varyzgin ecee1ae
[FORK][FEATURE] Enabled BWD (JIT/GEMM) FP32/BF16 Convoltions + Depthw…
17482fa
[FIX] Fixes for MKLDNN to enable LTO
ilya-lavrenov d3b9514
[FIX] [MSVC] Enabling SIMD functionality for VS2019
133bbd8
[FIX] Add several uni instruction wrappers into jit_generator
AlexPeskov 26b541b
[FIX] Fix name matching with system struct 'user' in llvm-android too…
AlexPeskov 177e84c
[FORK][FEATURE] Added JIT FP32/BF16 Softmax for arbitrary inner_size
3195eab
[FORK][FEATURE] Added support of hsigmoid, round_half_to_even, round_…
a-sidorova 6fa3d46
[FIX] Limit applicability of is_1stconv logic for JIT FP32/BF16 AVX51…
7f90108
[FIX] [WA] Removed kernel_outside_src condition on JIT FP32/BF16 Conv…
1d9879c
[FORK][FEATURE] Added custom vesrion of JIT DW FP32/BF16 Convolution …
cc7cdc3
[FORK][FEATURE] Asymmetric quntization for activations
48f1985
[FORK][FEATURE] Added 3D DW case support for JIT INT8 Convolutions
33c8075
[FORK][FEATURE] Added JIT AVX512/AVX2 FP32 Planar Convolution impleme…
a98a81c
[FORK][FEATURE] Binary networks support
8692be6
[FIX] Accommodating oneTBB (with hybrid cores support) that
myshevts e30973e
[FIX] [WA] Fixed fallback on ref conv in case exceeding scratchpad limit
805bfb2
[FORK][FEATURE] Returned old behavior for fp32 avx2 1x1 conv with dw …
antonvor 20e65b5
[FIX] Updated SoftPlus
a-sidorova 8d96cec
[FIX] Disable reorder JIT if both inputs and outputs are batch-strided.
IvanNovoselov e69c895
[FIX] Include TBB headers as system
AlexPeskov c9fa057
[FORK][FEATURE] nspc layout support for convolutions
luweizhou2016 88fffb8
[FIX] set scale = 1.f in case of signed input on platforms without vnni
antonvor 776be72
[FIX] Memory descriptor dynamism related changes
maxnick 2320b7e
[FORK][FEATURE] Added prelu as binary post op
antonvor 5ef9946
[FORK][FEATURE] Depthwise and Quantization post ops for Gemm Convolut…
antonvor ff5f753
[FORK][FIX] perf fixes for quantization post ops
antonvor 3850908
[FIX] todo: fix assert(idx < max_idx)
antonvor 610db3e
[FIX] [1D] Enlarge support
alexey-varyzgin 79d3c93
[FIX] Hash utility functions were extracted to a separate module for …
maxnick 3fe61e6
[FIX] Desc similar_to routine consider start stride
maxnick 0df98e3
[FIX] Desc similar_to routine use stride cmp mask
maxnick f3fb464
[FIX] added some legacy parallel methods to fix perf issues
antonvor d3f0e18
[FORK][FEATURE] Migrate legacy post ops and zero points on runtime da…
luweizhou2016 589807f
[FIX] fix ci error
luo-cheng2021 292d3ba
[FIX] [WA] stride=2, left pad =1, kw*kh=1*1 may crash
luo-cheng2021 cc2d3ef
[FORK][FEATURE] gemm_conv support binary post ops
luo-cheng2021 b245a31
[FORK][FIX] prelu post ops fix
EgorDuplensky 86f9ea2
[FORK][FEATURE] fork dw conv support binary postops
luo-cheng2021 039b72c
[FORK][FEATURE] gemm bf16 support binary postops& sse4.1 1x1 binray t…
luo-cheng2021 ada2d14
[FORK][FEATURE] avx512 fork bf16 dw support binary postops
luo-cheng2021 6541345
[FIX] fork dw conv may overflow on width tail
luo-cheng2021 5657cb5
[FORK][FEATURE] gemm int8 support binary postops
luo-cheng2021 c11d0e9
[FORK][FEAUTRE] Add log to jit dump code
luweizhou2016 7b20140
[FIX] [WA] Disabled weights md transpose in FC to prevent perf degrad…
83d2dc5
[FIX] Remove average pooling exclude padding limitation
EgorDuplensky 5b8d7f0
[FIX] Added support for exceptions during code generation
lohika-denis-kotov e8746ba
[FIX] fix cpu convolution qdq testcase fail issue when using scratchpad
liubo-intel 1bdec8d
[FIX] CPU: x64: fix issue in eltwise post ops to allow multi-instance…
usstq eff8e6d
[FORK][FEATURE] Add jit debug trace log dump in GCC debug mode
usstq 4d9f59b
[FIX] Fix seg fault in parallel function with ITT build
EgorDuplensky 6befa4d
[FIX] Add option to explicitly disable XBYAK_NO_EXCEPTION
EgorDuplensky 68504db
[FIX] Extend AMX deconv to support oscale+eltwise+eltwise post ops.
luweizhou2016 bc63709
[FIX] Fixed compilation for 32bits
ilya-lavrenov f7265c6
[FORK][FEATURE] jit_uni_reorder: relaxed isa condition to enable FP16…
antonvor 436d88b
[FORK][FEATURE] cpu: Unify oc_block for inner product with heuristic
luweizhou2016 0a0c847
[FIX][WA] Apply recorder WA caused by compiler issue on AVX2 windows …
luweizhou2016 5c7d70c
[FORK][FIX][x64] Refactor avx2 binary PReLU and fix reg conflicts
maxnick 823850e
[Fork][Fix] Deconv update the limitation.
luweizhou2016 ccf8276
[FIX] Fix warining caused by missing header file.
luweizhou2016 a7d0867
[FORK][FEATURE] Cc based on master (#135)
zhwuwuwu 58dbced
[FORK] [FEATURE] cpu: add inner product with sparse packed weights
jianan-gu fcf8998
[FORK][FEATURE] InnerProduct primitive: squashed weight decompression
luweizhou2016 3d9f274
[FORK][FEATURE] Support (f32,bf16,f32) inner-product
3523e9e
[FORK][FEATURE] Enable avx2 jit reorder for bf16 data type
36a18db
[FORK][FEATURE] IP weights compression: mxfp4 (wei=f4e2m1, scales=f8e…
51a48c6
[X64] Fixed need_mask_register for eltwise injectors
a-sidorova f2b408c
[FORK][FIX] Fixed debug assert in jit_io_helper_t
4f0f1f3
temp fix
azhai219 cd20975
[CPU][fix] fix matmul decompress test case for migration v3.8 (#1)
tiger100256-hu 40d9d5c
cpu: x64: matmul: correct LDD when M == 1 (#2)
tiger100256-hu e4d97f3
cpu: x64: guard macro definitions to avoid potential Wundef hits
dzarukin 54e4db8
[FORK][FIX] Fix missing 'map' include introduced by xbyak debug logic
tiger100256-hu 5d91309
[FORK][FIX] IP weights compression: max bcast blocking computation
2f9e73e
[FORK][FEATURE] DQ IP: performance enhansments
0f9b7fb
[FORK][FiX] fix IP compress test case after migration v3.8 on avx2
tiger100256-hu cd0b8d8
[FORK][FIX] fix args checking issue
tiger100256-hu 9ad2016
[FORK][FIX] add missing override
tiger100256-hu fb38e19
[FORK][Fix] Fix condition compilation
tiger100256-hu 38c7c03
[FORK][FIX] fix LLM FP16 Failed on avx512 and avx2
tiger100256-hu 2577b22
[FORK][FIX] fix riscv cmake issue
tiger100256-hu 902df2e
[FORK][FIX] fix crash of convolution 1x1 int8 model on SPR (#9)
tiger100256-hu a64d9a0
[ARM] Hide x64 dependent implementation under macro
alvoron c764191
[ARM] ARM 32bits support for oneDNN
alvoron d8054b4
[ARM] Added ARM32 ACL kernels calls
alvoron 328e942
[MERGE THIS INTO ANOTHER COMMIT] brgemm_matmul_matrix_B_reorder_t fix
alvoron ed34759
[FORK][FEATURE][ARM] Enable f16 ACL post-op
alvoron 301e94f
[FORK][ARM][FIX] Fix ACL configuration and skip failed tests
alvoron 1a918af
[ARM] New heuristic for winograd and gemm (ACL)
allnes 634c2db
[ARM][FORK] Resolve float32_t type on 32-bit platforms
alvoron 7eb6272
[ARM][FORK][FIX] Set CMAKE_CXX_STANDARD to 20 on Android
alvoron 9b9b876
[ARM][FORK][FIX] Use FORCE_INLINE for load_float_value
aobolensk File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume this is done for performance improvement. Probably a better alternative is to change the loading method in the kernel/implementation of interest. The switch below is mostly the killer of any benefits. The function was not designed to be performant in any way.
Feel free to resolve, it's just a general observation.