FEX-2211
Read the blog post at FEX-Emu's Site!
A lot of good changes this month for our users. Both performance and compatibility improvements to be had!
Segment register index optimization
This optimization has been a long time coming. Sitting in pull-request limbo since back in April. This is an optimization to cache segment register
addresses so the JIT can more optimally generate memory accesses. While segment registers are mostly gone with x86-64, 32-bit segment registers are
used fairly commonly with some instructions completely implicitly. This just adds overhead to fetch the LDT and GDT entries for something that
typically doesn't change very quickly.
With this optimization in place, we get an average of 4.3% uplift in 32-bit Bytemark. This performance improvement will be directly felt when
running 32-bit applications.
48-bit Proton Experimental fixes
For a while now FEX has worked with Proton 7.0 and older, but we have had issues running Proton Experimental in some cases.
This was a tricky problem to nail down but we had some good leads. If your ARM device was running its kernel with 48-bit Virtual address space (VA) enabled then Proton
Experimental wouldn't work. On the other-hand if your kernel is compiled using a 36-bit VA then it would run fine. After a few days of debugging, it
turns out that Proton/Wine allocates the lowest 32MB of its stack space, and the kernel by default allocates a 128MB space for the application.
When an application is ran natively the stack is allocated at the fixed location in memory. FEX was failing to allocate the stack at the correct
location. When Wine's preloader eventually ran; FEX will have allocated JIT code at that fixed location, which Wine would then map over, zeroing the
memory and breaking the FEX JIT. The preloader has done this for a long time and it was by pure chance that we weren't breaking older versions of Wine
and Proton.
With this problem fixed in FEX, we are now able to run triple-A games on AArch64. Just like the following images of God of War running on Snapdragon
888.
Even more IR changes preparing for AVX emulation
Once again this month we have a absolute ton of commits from Lioncash working on making our JIT be ready for AVX emulation. Around 25 commits working
towards this, with only about four more IR vector operations to support AVX with.
Once the JITs support 256-bit operations, we can start working towards emulating the instructions themselves.
Fix thunk crashing due to insufficient stack space
When FEX starts we potentially need to allocate all memory inside of the 48-bit VA space to match how x86-64 only has 47-bits.
This intersects with our stack space allocation which is supposed to autogrow, but we allocated it instead. Now we give the full 128MB stack space to
FEX so it won't crash anymore.
Implements support for remaining BCD instructions
Thanks to @wannacu for implementing the remaining handful of 32-bit BCD instructions. DAA, DAS, AAA, AAS, AAM,
AAD were all missing in FEX's implementation. While BCD is fairly uncommonly used these days, they still managed to find an application that uses
these instructions. With these implemented, FEX should have all of the BCD instructions finally implemented.
Implement gpuvis timeline profiler support
While not majorly important for users, this is a very good interface for developers wanting to watch why a game has stuttered and for how long code
took to compile. This lets us take advantage of the same interface that GPU profiling events are using to see why a game missed a vsync.
This isn't enabled by default out of concern for taking too much CPU time, so it needs to be enabled with the ENABLE_FEXCORE_PROFILER cmake
option.
Fix ROR OF flag calculation
This is a fairly minor bug since not many things rely on the OF flag specifically. But in our testing of new Proton games, we found out that Denuvo
Anti-Tamper is relying on this edge case behaviour and we messed it up. While this gets Denuvo running slightly farther, it still doesn't quite work
under FEX.
Fixes FPREM1 C2 flag calculation
FPREM1 will return a flag if the number was too large to calculate in one step. Which is usually not the case. Since we are calculating the full
remainder we will never set say we return a partial remainder. This solves an infinite loop in Mono applications that are using SIN/COS math
operations.
Claim X87 transcenental ops are in range
X87 will set a flag if a program tries to operate on a value that is out of range for trancendental SIN/COS/TAN operations.
FEX-Emu doesn't actually detect these for performance reasons, so instead claim these are always in range. While not always true, if they are out of
range then we weren't detecting them anyway. Fixes an issue where glibc would do some fixups to try and bring the value in range, resulting in invalid
results.
Add missing thunk library versions
This fixes an issue where FEX thunks would try to dlopen development libraries, which are missing on most user's devices.
Fixes indirect thunks with 8+ arguments
This fixes a quite bad crash with OpenGL and Vulkan thunking where every function with 8 or more arguments would be likely to break.
Fixes thunks for a bunch of games.
Add support for disabling thunks in application configurations
This is useful for narrowing down thunk compatibility issues in certain applications. While it is still not recommended to enable thunks globally,
this allows more flexibility with tinkering with it
Implements four more auxv values
FEX implements most of these values for applications to pull but in some cases we didn't have these setup. Specifically AT_PLATFORM is required
so ldconfig can work correctly. AT_HWCAP/AT_HWCAP2 is used for an application to check for CPU features, and AT_RANDOM is a 128-bit
random number that the kernel provides.
Misc
Quite a few more things that were changed this month, but this report has been going on long enough.
Raw Changes
-
32bit
-
Fixes Debug build of VDSO (cf91ab9)
-
Allocator
-
Expand stack space when stealing virtual address space (000677a)
-
Arm64
-
BranchOps
-
Remove unused std::vector (74e18f4)
-
ConversionOps
-
Eliminate use of temporary in Vector_FToF (b1d98f4)
-
MemoryOps
-
Merge if statement into switch in ParanoidLoadMemTSO (199649b)
-
Remove lingering unnecessary ptrue instances (40d820f)
-
VectorOps
-
Make use of MOVPRFX where applicable (639d6e6)
-
Simplify SVE VSXTL/VSXTL2/VUXTL/VUXTL2 implementations (136f1e2)
-
ELFCodeLoader
-
Fixes Proton Experimental on 48-bit VA systems (2fa1a64)
-
Implement four more auxv values (fa5322d)
-
External
-
Update vixl submodule (fc6de5f)
-
FEXCore
-
Adds support for a timeline profiler interface (62a24bd)
-
FEXServer
-
Be robust against invalid packets. (64eb87e)
-
Flags
-
Refine _Bfe's shift (abb44d3)
-
IR
-
Handle 256-bit VInsElement (9e7daf6)
-
Handle 256-bit LoadContextIndexed/StoreContextIndexed (03f0edc)
-
Handle 256-bit StoreMem/StoreMemTSO/ParanoidStoreMemTSO (70a91ee)
-
Handle 256-bit LoadMem/LoadMemTSO/ParanoidLoadMemTSO (8a14f87)
-
Handle 256-bit LoadContext/StoreContext (d475b0b)
-
Handle 256-bit VTBL1 (2332c41)
-
Handle 256-bit VInsGPR (b7d9c00)
-
Handle 256-bit VExtractToGPR (b3ee5db)
-
Handle 256-bit Vector_FToI (7e81023)
-
Check for invalid conversion masks in Float_FromGPR_S (7291b10)
-
Handle 256-bit Vector_FtoF (b8f7e4c)
-
Handle 256-bit Vector_FToZS/Vector_FToS (13003da)
-
Handle 256-bit Vector_SToF (cb17ee9)
-
Handle 256-bit VDupElement (27b022d)
-
Handle 256-bit VUnZip/VUnZip2 (780e3c7)
-
Handle 256-bit VZip/VZip2 (ab45db1)
-
Handle 256-bit VUShrNI/VUShrNI2 (bb38bcb)
-
Handle 256-bit VSQXTUN/VSQXTUN2 (6cc2912)
-
Handle 256-bit VSQXTN/VSQXTN2 (1c7d416)
-
Handle 256-bit VSMull/VSMull2 (3ac5e04)
-
Handle 256-bit VUMull/VUMull2 (78a0773)
-
Handle 256-bit VUXTL/VUXTL2 (e9f3a5b)
-
Handle 256-bit VUABDL (ebc45df)
-
Handle 256-bit VSXTL/VSXTL2 (f14a5ff)
-
Interpreter
-
MiscOps
-
Remove unused StopThread() function (d2e0dc9)
-
OpcodeDispatcher
-
Fixes ROR imm OF calculation (eca9353)
-
Fixes flag calculation on ROR and ROL by immediate (b1e475d)
-
Fixes FPREM1 C2 flag calculation (4a09a43)
-
Syscalls
-
Fixes 64-bit mmap and munmap (d1b235d)
-
Thunks
-
Add support for disabling thunks in config (004c323)
-
Fixes missing thunk librarie so versions (2272b30)
-
Fixes indirect thunks with 8+ arguments (6b3d888)
-
Update Vulkan thunk to v1.3.231 (ddc1027)
-
Guest
-
Enable SSE2 on thunks and set fpmath to sse (671f3e7)
-
X11
-
Reorder and sort X11 interface by headers included. (8d373c1)
-
libX11
-
Fix recursive initialize (d838612)
-
ThunksDB
-
Fixes Thunks loaded boolean pointer check (b5fb1cb)
-
Utils
-
64BitAllocator
-
Minor cleanups and optimization for munmap (70a3ceb)
-
X87
-
Claim incoming float was in the range for trancendental ops (aa5e92b)
-
Misc
-
Segment register index optimization (ecf4891)
-
Implements DAA, DAS, AAA, AAS, AAM and AAD instruction (f26eccd)
-
Ensure Arm64Emitter uses FEX allocator (ffb4de9)
-
unittests
-
Amend mm register usage in H0F38/66_04.asm test (cada0d5)
-
asm
-
Adds more extensive FPREM/FPREM1 tests (b726f60)
-
gvisor
-
Adds a bunch of tests to flakes (5bef13d)