You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Default selected low precision kernel is not optimal for described below platform.
We have only 30% performance gain for low precision kernel VS fp16 in multithreaded mode. Can you confirm, please, that these are the results you expect? Our expected performance gain was 2x.
Platform:
system_profiler SPHardwareDataType
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: Mac15,6
Chip: Apple M3 Pro
Total Number of Cores: 12 (6 performance and 6 efficiency)
Memory: 18 GB
What is the data layout used for these workloads when calling into ACL? It would help if you could build ACL with logging=1 so that we can know more details about these workloads.
Issues:
Platform:
Operating System:
Command line
Single thread:
cppthreads=0
Multithread:
cppthreads=1
Results fp16, default kernel: a64_hybrid_fp16_mla_6x32)
Single thread, shapes: 4096x128 * 128x4096
Multithread thread, shapes: 4096x128 * 128x4096
Results int8, default selected kernel: a64_hybrid_s8s32_mmla_6x16
Single thread, shapes: 4096x128 * 128x4096:
Multithread, shapes: 4096x128 * 128x4096:
Results int8, manual selected kernel: a64_interleaved_s8s32_mmla_8x12
Single thread, shapes: 4096x128 * 128x4096:
Multithread, shapes: 4096x128 * 128x4096:
The text was updated successfully, but these errors were encountered: