Running x86 AVX2 binary on Apple Silicon with Rosetta 2 (macOS 15) #5707

twoplan · 2024-12-06T11:08:52Z

Describe the issue

zsh: illegal hardware instruction ./stockfish-macos-x86-64-avx2 compiler

Expected behavior

AVX2 binary should run on macOS 15 with Rosetta 2

Steps to reproduce

Run macOS x86-AVX2 Stockfish 17 in terminal on an Apple Silicon Mac with macOS 15.x
./stockfish-macos-x86-64-avx2 compiler

Anything else?

With Rosetta 2 it should now be possible to use x86 binaries with AVX2 instructions on Apple Silicon macs.

But on an M4 mac running macOS 15.1.1 I get this in the terminal

zsh: illegal hardware instruction ./stockfish-macos-x86-64-avx2 compiler

Operating system

MacOS

Stockfish version

official Stockfish 17 x86-AVX2 for macOS

The text was updated successfully, but these errors were encountered:

Disservin · 2024-12-06T11:18:51Z

I think this is expected https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment#What-Cant-Be-Translated
AVX2 is not supported ?

Disservin · 2024-12-06T11:19:24Z

Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.

Disservin · 2024-12-06T11:24:10Z

Why are you even doing this ? Does our m1 release not work for your m4?

Disservin · 2024-12-06T11:34:38Z

Ah I see there are some articles about macOS Sequoia's rosetta being able to support avx2, would be good to know if it crashes on an avx2 instruction or something else

twoplan · 2024-12-06T19:03:36Z

The provided arm64 version runs perfect and fast on macOS!

I was curious about the speed of x86 binaries under Rosetta 2. And expected, that the avx2 version could be the fastest x86 binary of the three (like on Intel or Amd).

The single core bench gives these results for Stockfish 17 on my mac:

arm64:		1830739 Nodes/second
x86:		 647787 Nodes/second
x86_popcnt:	1085328 Nodes/second
x86_avx2:	<illegal instruction>

Disservin · 2024-12-07T14:11:59Z

Can you run it with lldb and get a stack trace?

twoplan · 2024-12-07T15:01:47Z

lldb ./stockfish-macos-x86-64-avx2                        
(lldb) target create "./stockfish-macos-x86-64-avx2"
Current executable set to '/Users/max/Downloads/stockfish/stockfish-macos-x86-64-avx2' (x86_64).
(lldb) run
Process 37175 launched: '/Users/max/Downloads/stockfish/stockfish-macos-x86-64-avx2' (x86_64)
warning: libobjc.A.dylib is being read from process memory. This indicates that LLDB could not read from the host's in-memory shared cache. This will likely reduce debugging performance.

Process 37175 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x000000010003d831 stockfish-macos-x86-64-avx2`___lldb_unnamed_symbol332 + 3089
stockfish-macos-x86-64-avx2`___lldb_unnamed_symbol332:
->  0x10003d831 <+3089>: blsrq  %rsi, %rsi
    0x10003d836 <+3094>: je     0x10003d7a0    ; <+2944>
    0x10003d83c <+3100>: jmp    0x10003d825    ; <+3077>
    0x10003d83e <+3102>: popq   %rbx
Target 0: (stockfish-macos-x86-64-avx2) stopped.

Disservin · 2024-12-07T15:06:26Z

Ah I kinda expected this, we add bmi1 to the compiler flags for avx2 since #4202. Pretty much any platform which has avx2 also has this instruction but since apple is adding translations from x86 to arm they don't..

Disservin · 2024-12-07T15:14:08Z

Since we distribute arm binaries for mac I don't feel like removing this, if you still want to test the speed you can try to remove the -mbmi from the Makefile and recompile and run your test.

twoplan · 2024-12-07T15:32:26Z

Thanks for looking into!

Just compiled it on my intel mac.
Kind of disappointing that popcnt is faster than avx2 with Rosetta 2.

./stockfish compiler
Stockfish 17 by the Stockfish developers (see AUTHORS file)

Compiled by                : clang++ 16.0.0 on Apple
Compilation architecture   : x86-64-avx2
Compilation settings       : 64bit AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Apple LLVM 16.0.0 (clang-1600.0.26.4)

./stockfish bench > /dev/null
===========================
Total time (ms) : 1729
Nodes searched  : 1484730
Nodes/second    : 858721

Disservin · 2024-12-07T15:41:53Z

Yeah apple's translation isn't the best nor is it even correct, see https://github.com/carsongoodwin32/rosetta2_avx_dive.. 12% slower and wrong result

RogerThiede · 2024-12-07T20:03:20Z

Yeah apple's translation isn't the best nor is it even correct

I don’t have knowledge if it's different, but it should be pointed out that this reference was analyzing a pre-release (Beta) version of translation. It would certainly be noteworthy to claim that a final release produces wrong results, but I haven't discovered anyone claiming that yet.

Disservin · 2024-12-09T13:51:57Z

It would certainly be noteworthy to claim that a final release produces wrong results, but I haven't discovered anyone claiming that yet.

@RogerThiede I just ran the test code from the linked repo on my M1 with macOS 15.1.1 (24B91), with the avx2 code path always giving different results, this did not happen on my reference amd system.

Run 1:
SSE2 Int Sum Result: -874044994 Time: 1.94254 seconds
AVX Int Sum Result: -874044994 Time: 1.91181 seconds
AVX2 Int Sum Result: -1425102895 Time: 1.13537 seconds

Run 2:
SSE2 Int Sum Result: 1517038122 Time: 1.80133 seconds
AVX Int Sum Result: 1517038122 Time: 2.0612 seconds
AVX2 Int Sum Result: -1635806147 Time: 1.80573 seconds

Run 3:
SSE2 Int Sum Result: 1760641091 Time: 2.07865 seconds
AVX Int Sum Result: 1760641091 Time: 2.12589 seconds
AVX2 Int Sum Result: 300975694 Time: 0.843802 seconds

Run 4:
SSE2 Int Sum Result: 2004229182 Time: 1.09681 seconds
AVX Int Sum Result: 2004229182 Time: 0.455289 seconds
AVX2 Int Sum Result: -2057221549 Time: 0.526167 seconds

Run 5:
SSE2 Int Sum Result: 295671414 Time: 0.802427 seconds
AVX Int Sum Result: 295671414 Time: 0.471281 seconds
AVX2 Int Sum Result: 939717340 Time: 0.535566 seconds

Run 6:
SSE2 Int Sum Result: 1857177385 Time: 1.61133 seconds
AVX Int Sum Result: 1857177385 Time: 0.514061 seconds
AVX2 Int Sum Result: 2054205932 Time: 0.510901 seconds

Run 7:
SSE2 Int Sum Result: -973955045 Time: 1.18097 seconds
AVX Int Sum Result: -973955045 Time: 0.456183 seconds
AVX2 Int Sum Result: -1656341324 Time: 0.537099 seconds

Run 8:
SSE2 Int Sum Result: 1612452885 Time: 1.80013 seconds
AVX Int Sum Result: 1612452885 Time: 0.549086 seconds
AVX2 Int Sum Result: -806898896 Time: 0.651567 seconds

Run 9:
SSE2 Int Sum Result: -193770186 Time: 1.14623 seconds
AVX Int Sum Result: -193770186 Time: 0.471176 seconds
AVX2 Int Sum Result: -487530476 Time: 0.527587 seconds

Run 10:
SSE2 Int Sum Result: -1902318602 Time: 1.17754 seconds
AVX Int Sum Result: -1902318602 Time: 0.457959 seconds
AVX2 Int Sum Result: 361933310 Time: 0.519264 seconds

---------------------------Average of 10 Runs-------------------------------
SSE2 vs AVX: 40.361% runtime difference
SSE2 vs AVX2: 48.2491% runtime difference
AVX vs AVX2: 2.29604% runtime difference

I haven't checked the code in depth and maybe it relies on some undefined behavior, idk.

EDIT: There is an open issue on the repository stating that the test code has UB anyway.

mstembera · 2024-12-09T21:58:41Z

FYI No idea if this could be related but we currently have a bug making use of random TT data in search.
#5503

Disservin · 2024-12-09T22:11:55Z

not related in any way

twoplan mentioned this issue Dec 7, 2024

Running AVX2 binaries Gcenx/macOS_Wine_builds#115

Closed

Disservin added the build label Dec 9, 2024

Disservin closed this as not planned Won't fix, can't repro, duplicate, stale Dec 9, 2024

Disservin mentioned this issue Dec 9, 2024

Is it possible to use AVX2 instruction set on Apple devices? / Windows 11 ARM? #5672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running x86 AVX2 binary on Apple Silicon with Rosetta 2 (macOS 15) #5707

Running x86 AVX2 binary on Apple Silicon with Rosetta 2 (macOS 15) #5707

twoplan commented Dec 6, 2024

Disservin commented Dec 6, 2024 •

edited

Loading

Disservin commented Dec 6, 2024

Disservin commented Dec 6, 2024

Disservin commented Dec 6, 2024

twoplan commented Dec 6, 2024

Disservin commented Dec 7, 2024

twoplan commented Dec 7, 2024

Disservin commented Dec 7, 2024

Disservin commented Dec 7, 2024

twoplan commented Dec 7, 2024

Disservin commented Dec 7, 2024

RogerThiede commented Dec 7, 2024

Disservin commented Dec 9, 2024 •

edited

Loading

mstembera commented Dec 9, 2024

Disservin commented Dec 9, 2024

Running x86 AVX2 binary on Apple Silicon with Rosetta 2 (macOS 15) #5707

Running x86 AVX2 binary on Apple Silicon with Rosetta 2 (macOS 15) #5707

Comments

twoplan commented Dec 6, 2024

Describe the issue

Expected behavior

Steps to reproduce

Anything else?

Operating system

Stockfish version

Disservin commented Dec 6, 2024 • edited Loading

Disservin commented Dec 6, 2024

Disservin commented Dec 6, 2024

Disservin commented Dec 6, 2024

twoplan commented Dec 6, 2024

Disservin commented Dec 7, 2024

twoplan commented Dec 7, 2024

Disservin commented Dec 7, 2024

Disservin commented Dec 7, 2024

twoplan commented Dec 7, 2024

Disservin commented Dec 7, 2024

RogerThiede commented Dec 7, 2024

Disservin commented Dec 9, 2024 • edited Loading

mstembera commented Dec 9, 2024

Disservin commented Dec 9, 2024

Disservin commented Dec 6, 2024 •

edited

Loading

Disservin commented Dec 9, 2024 •

edited

Loading