Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running x86 AVX2 binary on Apple Silicon with Rosetta 2 (macOS 15) #5707

Closed
twoplan opened this issue Dec 6, 2024 · 15 comments
Closed

Running x86 AVX2 binary on Apple Silicon with Rosetta 2 (macOS 15) #5707

twoplan opened this issue Dec 6, 2024 · 15 comments
Labels

Comments

@twoplan
Copy link

twoplan commented Dec 6, 2024

Describe the issue

zsh: illegal hardware instruction ./stockfish-macos-x86-64-avx2 compiler

Expected behavior

AVX2 binary should run on macOS 15 with Rosetta 2

Steps to reproduce

Run macOS x86-AVX2 Stockfish 17 in terminal on an Apple Silicon Mac with macOS 15.x
./stockfish-macos-x86-64-avx2 compiler

Anything else?

With Rosetta 2 it should now be possible to use x86 binaries with AVX2 instructions on Apple Silicon macs.

But on an M4 mac running macOS 15.1.1 I get this in the terminal

zsh: illegal hardware instruction ./stockfish-macos-x86-64-avx2 compiler

Operating system

MacOS

Stockfish version

official Stockfish 17 x86-AVX2 for macOS

@Disservin
Copy link
Member

Disservin commented Dec 6, 2024

@Disservin
Copy link
Member

Rosetta translates all x86_64 instructions, but it doesn’t support the execution of some newer instruction sets and processor features, such as AVX, AVX2, and AVX512 vector instructions.

@Disservin
Copy link
Member

Why are you even doing this ? Does our m1 release not work for your m4?

@Disservin
Copy link
Member

Ah I see there are some articles about macOS Sequoia's rosetta being able to support avx2, would be good to know if it crashes on an avx2 instruction or something else

@twoplan
Copy link
Author

twoplan commented Dec 6, 2024

The provided arm64 version runs perfect and fast on macOS!

I was curious about the speed of x86 binaries under Rosetta 2. And expected, that the avx2 version could be the fastest x86 binary of the three (like on Intel or Amd).

The single core bench gives these results for Stockfish 17 on my mac:

arm64:		1830739 Nodes/second
x86:		 647787 Nodes/second
x86_popcnt:	1085328 Nodes/second
x86_avx2:	<illegal instruction>

@Disservin
Copy link
Member

Can you run it with lldb and get a stack trace?

@twoplan
Copy link
Author

twoplan commented Dec 7, 2024

lldb ./stockfish-macos-x86-64-avx2                        
(lldb) target create "./stockfish-macos-x86-64-avx2"
Current executable set to '/Users/max/Downloads/stockfish/stockfish-macos-x86-64-avx2' (x86_64).
(lldb) run
Process 37175 launched: '/Users/max/Downloads/stockfish/stockfish-macos-x86-64-avx2' (x86_64)
warning: libobjc.A.dylib is being read from process memory. This indicates that LLDB could not read from the host's in-memory shared cache. This will likely reduce debugging performance.

Process 37175 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0)
    frame #0: 0x000000010003d831 stockfish-macos-x86-64-avx2`___lldb_unnamed_symbol332 + 3089
stockfish-macos-x86-64-avx2`___lldb_unnamed_symbol332:
->  0x10003d831 <+3089>: blsrq  %rsi, %rsi
    0x10003d836 <+3094>: je     0x10003d7a0    ; <+2944>
    0x10003d83c <+3100>: jmp    0x10003d825    ; <+3077>
    0x10003d83e <+3102>: popq   %rbx
Target 0: (stockfish-macos-x86-64-avx2) stopped.

@Disservin
Copy link
Member

Ah I kinda expected this, we add bmi1 to the compiler flags for avx2 since #4202. Pretty much any platform which has avx2 also has this instruction but since apple is adding translations from x86 to arm they don't..

@Disservin
Copy link
Member

Since we distribute arm binaries for mac I don't feel like removing this, if you still want to test the speed you can try to remove the -mbmi from the Makefile and recompile and run your test.

@twoplan
Copy link
Author

twoplan commented Dec 7, 2024

Thanks for looking into!

Just compiled it on my intel mac.
Kind of disappointing that popcnt is faster than avx2 with Rosetta 2.

./stockfish compiler
Stockfish 17 by the Stockfish developers (see AUTHORS file)

Compiled by                : clang++ 16.0.0 on Apple
Compilation architecture   : x86-64-avx2
Compilation settings       : 64bit AVX2 SSE41 SSSE3 SSE2 POPCNT
Compiler __VERSION__ macro : Apple LLVM 16.0.0 (clang-1600.0.26.4)

./stockfish bench > /dev/null
===========================
Total time (ms) : 1729
Nodes searched  : 1484730
Nodes/second    : 858721

@Disservin
Copy link
Member

Yeah apple's translation isn't the best nor is it even correct, see https://github.com/carsongoodwin32/rosetta2_avx_dive.. 12% slower and wrong result

@RogerThiede
Copy link

Yeah apple's translation isn't the best nor is it even correct

I don’t have knowledge if it's different, but it should be pointed out that this reference was analyzing a pre-release (Beta) version of translation. It would certainly be noteworthy to claim that a final release produces wrong results, but I haven't discovered anyone claiming that yet.

@Disservin Disservin added the build label Dec 9, 2024
@Disservin
Copy link
Member

Disservin commented Dec 9, 2024

It would certainly be noteworthy to claim that a final release produces wrong results, but I haven't discovered anyone claiming that yet.

@RogerThiede I just ran the test code from the linked repo on my M1 with macOS 15.1.1 (24B91), with the avx2 code path always giving different results, this did not happen on my reference amd system.

Run 1:
SSE2 Int Sum Result: -874044994 Time: 1.94254 seconds
AVX Int Sum Result: -874044994 Time: 1.91181 seconds
AVX2 Int Sum Result: -1425102895 Time: 1.13537 seconds

Run 2:
SSE2 Int Sum Result: 1517038122 Time: 1.80133 seconds
AVX Int Sum Result: 1517038122 Time: 2.0612 seconds
AVX2 Int Sum Result: -1635806147 Time: 1.80573 seconds

Run 3:
SSE2 Int Sum Result: 1760641091 Time: 2.07865 seconds
AVX Int Sum Result: 1760641091 Time: 2.12589 seconds
AVX2 Int Sum Result: 300975694 Time: 0.843802 seconds

Run 4:
SSE2 Int Sum Result: 2004229182 Time: 1.09681 seconds
AVX Int Sum Result: 2004229182 Time: 0.455289 seconds
AVX2 Int Sum Result: -2057221549 Time: 0.526167 seconds

Run 5:
SSE2 Int Sum Result: 295671414 Time: 0.802427 seconds
AVX Int Sum Result: 295671414 Time: 0.471281 seconds
AVX2 Int Sum Result: 939717340 Time: 0.535566 seconds

Run 6:
SSE2 Int Sum Result: 1857177385 Time: 1.61133 seconds
AVX Int Sum Result: 1857177385 Time: 0.514061 seconds
AVX2 Int Sum Result: 2054205932 Time: 0.510901 seconds

Run 7:
SSE2 Int Sum Result: -973955045 Time: 1.18097 seconds
AVX Int Sum Result: -973955045 Time: 0.456183 seconds
AVX2 Int Sum Result: -1656341324 Time: 0.537099 seconds

Run 8:
SSE2 Int Sum Result: 1612452885 Time: 1.80013 seconds
AVX Int Sum Result: 1612452885 Time: 0.549086 seconds
AVX2 Int Sum Result: -806898896 Time: 0.651567 seconds

Run 9:
SSE2 Int Sum Result: -193770186 Time: 1.14623 seconds
AVX Int Sum Result: -193770186 Time: 0.471176 seconds
AVX2 Int Sum Result: -487530476 Time: 0.527587 seconds

Run 10:
SSE2 Int Sum Result: -1902318602 Time: 1.17754 seconds
AVX Int Sum Result: -1902318602 Time: 0.457959 seconds
AVX2 Int Sum Result: 361933310 Time: 0.519264 seconds

---------------------------Average of 10 Runs-------------------------------
SSE2 vs AVX: 40.361% runtime difference
SSE2 vs AVX2: 48.2491% runtime difference
AVX vs AVX2: 2.29604% runtime difference

I haven't checked the code in depth and maybe it relies on some undefined behavior, idk.

EDIT: There is an open issue on the repository stating that the test code has UB anyway.

@mstembera
Copy link
Contributor

FYI No idea if this could be related but we currently have a bug making use of random TT data in search.
#5503

@Disservin
Copy link
Member

not related in any way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants