Releases: FEX-Emu/FEX
FEX-2402
Read the blog post at FEX-Emu's Site!
Welcome back everyone! After last month's cancelled release and this month being a bit late we have a lot of changes that happened.
More JIT performance improvements
A lot of the work these paste two months have been optimizing our JIT more. We have run Geekbench and Bytemark for these which showed a marginal
performance improvement in these benchmarks. Bytemark showing the biggest improvement of 16% in one sub-benchmark. A lot of the performance
improvements are targeting real-world applications rather than benchmarks which shows as those games getting more of an improvement.
As typical, explaining each individual optimization would take too long so we're going to spam out a bunch in a list.
- Removes a vtable indirection for syscalls
- Fix RCL/RCR wraparound behaviour
- Remove process-wide lock in JIT
- Fixes syscall rcx/r11 state
- Optimize SIB address calculation from three instructions to one
- Optimize TST instruction with -1
- Optimize TST more
- Improve XCHG instructions
- Optimize rotates
- Optimize CDQ
- Optimize shifts
- Optimize PTEST, VTESTP, PDEP
- Optimize SHA256 instructions to remove spilling
- Optimize CMPXCHG
- Stop zero extending a bunch of instructions where it doesn't matter
- Optimize ANDN
- Optimize a bunch of instructions using NZCV flags
Fix glibc clone usage of CLONE_CLEAR_SIGHAND
Newer glibc versions starting with 2.38 have started using this new clone flag for executing a program. We also fixed this in 2312.1 but can now make
a note of it.
Fix VDSO symbol fetching on ARM64
This is a fairly minor change but can have a big performance hit. When FEX was querying for VDSO interface functions we were using the wrong names on
ARM64. Since the wrong names were used, this meant we always fell back to the slower glibc implementation of functions. This in particular fixes a
performance hit when games call clock_gettime excessively.
Fix Proton again
Sometime in December there were some changes to Valve's Proton layer which caused us to break it. This has now been fixed.
Expose Linux 6.6
With some relatively minor changes we now support reporting kernel version 6.6 to the guest application. This gives us a range from v5.0 to v6.6 now.
Workaround hang when process is forking
A long-standing bug in FEX is that sometimes a process can hang when it is forking, usually to execute another program. We have now worked around this
issue to an extent that lets the application continue. It's not a full fix because we can still have a crash but that is easier to see instead of a
program hanging forever in the background.
Commonize some WOW64 code to share with ARM64ec
In preparation for sharing some code with FEX built for ARM64ec, this has shared move some Windows code to a common location to be used.
An absolute ton of work went in to thunking
Over the past couple months this has been one of the more active projects within FEX. Today FEX has support for thunking 64-bit x86 libraries across
to ARM64. A significant portion of this work is doing analysis of API interfaces in order to allow thunking 32-bit x86 libraries over to ARM64
libraries with data repacking. This isn't yet complete but since a ton of work has gone in to this, we wanted to call it out.
NOTE: Memory leak on long-running processes like Steam
We have found a memory leak when a process shuts down a thread that has been around for quite a while. We only identified this memory leak this last month which hasn't been fixed.
We are hoping to fix this bug for the next release but be aware that long running processes like Steam has a relatively aggressive memory leak. This is exacerbated by how Steam spins up threads for
doing work which makes this application particularly heavy.
Raw Changes
FEX Release FEX-2402
-
Arm64
-
Removes a vtable indirection in syscalls (743df8d)
-
BranchOps
-
Fix unused-variable warning (f515b1e)
-
ArmEmitter
-
Support single use forward labels (8c31630)
-
CPUID
-
Removes Init and just uses constructor (f5997a0)
-
Config
-
Fixes JSON parsing of "ArgumentHandler" types (bd1e029)
-
Dispatcher
-
Convert GetCompileBlockPtr to using PMF helper (82ce76b)
-
Removes unused asserting CompileBlock function (5b4e9c6)
-
Externals
-
Update xbyak to v7.02 and switch away from fork (9da08b4)
-
FEX
-
Removes legacy kernel 32-bit allocator (de2cd46)
-
FEXConfig
-
Initialize paths before trying to read configuration files (79526b9)
-
FEXCore
-
Fix RCL/RCR shift wraparound behaviour (c0be974)
-
Use TMP1-4 for values that need preserving across spills (0e97f8f)
-
Decompose some std::function usage to regular pointers (615cfe0)
-
Pass thread object to HandleUnalignedAccess (d488592)
-
Removes SRA option, it's now permanently enabled (5467c3e)
-
Removes context wide and map lookup (eea2e7b)
-
Optimize HostFeatures and CPUID feature calculation (0071c1b)
-
Warn if MDWE is set (00669a1)
-
Changes ParentThread ownership from the CTX to the frontend, take 2 (b4b8e81)
-
Describe exit function linking object with a structure (93ec676)
-
Removes stale references to x86 JIT (8665490)
-
Removes old InternalThreadState header (12b72f9)
-
Moves OS thread creation to the frontend (0a4e064)
-
Moves XID check to the frontend (7524029)
-
FEXLinuxTests
-
Fix build warnings (d34302a)
-
FEXLoader
-
Moves thread management to the frontend (5e26b77)
-
Temporarily disable CLONE_CLEAR_SIGHAND (26c9d5d)
-
Fix incorrect format strings (3a5ac39)
-
GdbServer
-
Fixes crash on gdb detach (a1cf14f)
-
HostFeatures
-
Supports runtime disabling of preserve_all (235f32c)
-
InstCountCI
-
Fixes test to not use relative data (3db31a6)
-
JIT
-
Fixes broken register in VTBX1 (9841983)
-
Jitarm64
-
Implements spin-loop futex for JIT blocks (750b0b7)
-
Linux
-
Decouple thread object creation and tracking (5e5984a)
-
Implements a fault safe memcpy routine (8e3d4a3)
-
Adjust when clone allocates stack memory (bd13052)
-
OpcodeDispatcher
-
Fixes syscall rcx/r11 generation (56d8080)
-
Initial support for runtime long-mode switch (4b37921)
-
Fixes flags generation in imul (3d2cbc5)
-
Optimize SIB addr calculation (81c85d7)
-
PassManager
-
Removes unused exit handler (12923ba)
-
Scripts
-
More changes to InstallFEX script (b613576)
-
Updates InstallFEX with supported Ubuntu versions (eb5cf1a)
-
SpinLockWait
-
Fixes unexpected lock success (472a701)
-
SpinWaitLock
-
Removes unused variable in spin-loop fallback (a4a1d60)
-
TestHarnessRunner
-
Move to its own tool folder (https://github.com/FEX-Emu/FEX/commit/...
FEX-2312
Read the blog post at FEX-Emu's Site!
We're back with another month of changes. After last month being a bit slower, we're back in the swing of implementing more optimizations and bug fixes. No dilly dallying, let's get right in to it!
More optimizations this month!
Once again this month has a whole bunch of optimizations that is very exciting! We will lightly go over the changes to talk about what changed.
Keep guest SF/ZF/CF/OF flags resident in host NZCV
This is one of the bigger optimizations this month. A bit of backstory is needed for what this optimization is for. x86 has a flags register called EFLAGS which contains quite a few random bits of information. The subflags we care about here are the SF, ZF, CF, and OF flags inside of it. These are various flags that are set typically from ALU operations for information depending on the result. So something like an integer Add will set ZF if the result was zero, SF if the result has the sign bit set, CF if a carry occured, and OF if the operation overflowed. These are usually quite cheap for the CPU to calculate by itself, but manually calculating the flags usually takes a few additional instructions each.
The original implementation inside of FEX for calculating these flags would spend the additional instructions and calculate each one manually. This
would usually end up with a dozen or so of additional instructions for calculating flags. While FEX would typically optimize out the calculations if
they weren't used, it would still add CPU time when we couldn't.
Luckily ARM also has a flags register called NZCV which maps almost perfectly to x86's EFLAGS. This lets us optimize these instruction implementations
to instead use the ARM flags directly. This has a couple of effects, not only does it remove the instructions from our code generation, it has
knock-on effects that the flags are now stored inside of the NZCV which reduces memory accesses. A multi-hit combo for improving performance.
While not all x86 instructions map their flags registers 1:1, this has a fairly significant performance uplift in most situations!
Dedicate registers for PF/AF
Related to the previous change, x86 has two flags registers stored inside of EFLAGS that doesn't have a direct equivalent on ARM CPUs. These two
flags are fairly uncommon but instructions will still generate them. These flags have the additional problem that they are fairly costly to calculate,
with one of them requiring a GPR population count instruction which ARM doesn't even support until new instruction extensions called CSSC. While in
most cases the result of these flags isn't used, the overhead of calculating them can add up a bit. This is why we are now dedicating two registers to
these flags to reduce their overhead as much as possible!
Misc optimizations
- Optimize BT/BTC/BTS/BTR
- Optimize shifts/rotates
- Optimize selects & branches & more nzcv goodies
- Optimize three sha instructions
- Make "not" not garbage
- Optimize memcpy and memset when direction is compile time constant
With all these optimizations in place this month we have a fairly significant performance uplift!
<-- Geekbench and bytemark graphs -->
While Geekbench is showing a fairly modest 17.6% performance uplift, bytemark is showing up to a 60% performance uplift! Over the course of
the last three months we have had benchmarks that have improved by over 100%! These improvements can be seen in games as well, with some CPU heavy
games have had their FPS improve by over 2x. In a lot of games tested they have changed from being CPU limited to GPU limited on our Lenovo X13s
laptops even! We are looking forward to when these companies release new laptops based on Snapdragon X
Elite in the middle of next year!
Various bug fixes
In addition to performance improvements, we have some bug fixes this month.
- Fixed corruption in the JIT
- Caused corruption with x87 heavy games
- Fixes integer multiply corrupting results
- Corrupted some register state, which was breaking the game Dungeon Defenders
Support extracting erofs images
One of the features that FEXRootFSFetcher was missing was the ability to extract erofs images once downloaded. This was because we didn't know that
erofs-utils provided an application for extracting these images without FUSE. Turns out the developers put an extractor inside of their fsck
application that we had completely missed! Now if a user wants to extract an x86 rootfs image for lower overhead, they can do this directly from our
FEXRootFSFetcher tool.
Preparation for improving gdbserver
GDBServer is a socket interface that GDB supports for remotely debugging applications. One of the harder things about working on FEX-Emu is that the
ability to debug an application is usually quite hard. GDBServer is a way to improve this situation so that GDB can remotely connect to a FEX process.
There's a bunch of work this month towards cleaning up this interface and getting it to work correctly. While it is still not quite usable for
debugging, we are working towards this so applications can actually be debugged!
Improvements to WOW64 compatibility for newer WINE
Newer versions of WINE has changed some behaviour around WOW64 support. So this month we have added support for some of this newer behaviour. Thanks
again to Bylaws for implementing this!
FEX rootfs image updates
This month we are updating our rootfs images to incorporate the latest Mesa 23.3.0 release that
occured a few days ago. We have updated our Ubuntu 22.04, 23.04, 23.10, ArchLinux, and Fedora 38 images with this latest version of mesa. As usual if
there are any issues, let us know so we can sort them out.
Raw Changes
FEX Release FEX-2312
-
Arm64Emitter
-
Dedicate registers for PF/AF (9b64674)
-
Fixes warning (bf147f4)
-
Config
-
Removes Threads option (3c73357)
-
Dispatcher
-
Fixes corruption when spilling SRA registers (f6b1434)
-
EmulatedFiles
-
Stop relying on O_TMPFILE (2e24f34)
-
FEX
-
Only pass CPU tunables to FEXCore and FEXLoader (b4eeb96)
-
FEXCore
-
Removes GetProgramStatus (c8ef77c)
-
Removes InitializeContext API (b35fadf)
-
Fixes passing arguments to ABI helpers (d0f54bc)
-
Removes Get/SetCPUState (7de66ac)
-
Optimize memcpy and memset when direction is compile time constant (85a1c1f)
-
Removes FEX_PACKED from CPUState (8e892ec)
-
Moves debug strings to gdbserver (3f02d7c)
-
Start changing how thread creation works (f328fca)
-
Moves more SignalDelegator functions to the frontend (6e8af29)
-
Removes x86 DebugInfo table (8015ce2)
-
Removes GetExitReason (bdf4089)
-
Disables RPRES until AFP is audited and enabled (b027113)
-
Fixes imul returning garbage data (389c6b1)
-
Work around broken preserve_all support in Windows clang (98f9a65)
-
FEXLoader
-
Wire up gdbserver in the frontend (efc5eb2)
-
FEXRootFSFetcher
-
Supports extracting erofs images (aa1344a)
-
GdbServer
-
Switch over to a unix domain socket (0357bb2)
-
IR
-
Moves remaining NZCV operations to use DestSize (b619f38)
-
IRDumper
-
Fixes missing conditional name (8892580)
-
InstCountCI
-
Moves Sonic Mania code to 32-bit file (470615b)
-
InstructionCountCI
-
Remove Optimal flags (4a31b61)
-
JIT
-
Fixes crash in TestNZ (a8ab8bb)
-
OpcodeDispatcher
-
Optimize three sha instructions (0e1e4c1)
-
Make "not" not garbage (3767f36)
-
ScopedSignalMask
-
Clean up API and use std::unique_lock/shared_lock (8726c8fb7377509bb569826f1...
FEX-2311
Read the blog post at FEX-Emu's Site!
Another month gone by and another FEX release out the door! This last month was a bit of a less busy month as most of our team spent a week in Spain
to take part in XDC 2023! We did still have the rest of the month to do some work although, so let's get
to the changes!
Small bug fixes
This month we fixed a couple of bugs with could have caused spurious crashes! In fact while testing some upcoming performance optimizations, we fixed
a few unrelated bugs that was crashing Steam periodically! Always nice to see a bunch of little work that just improves the software, even if they
aren't a single big fix.
- Fix register corruption when jumping out of JIT
- Fixes double munmap which would cause spurious pointer unmaps
- Fixes crashes when a program would shut down a thread
- Implements RPRES support and Fix implementation issue with ARM's new RPRES feature
- RPRES gives us the ability to do reciprocals in one instruction instead of using ARM's divide instruction.
- The bug would have caused invalid data to be returned
- No CPU supports this yet luckily
- Fixed issue with *at syscalls not working with absolute paths
- Broke Proton's pressure-vessel in weird and unique ways
- Fixes bug with named enum argument parser
- This is used to override CPU features with the FEX_HOSTFEATURES option so typically not hit
32-bit thunking infrastructure
While 32-bit thunking is not yet in place, and this month it still isn't fully integrated, some of the code has been landing to work towards this
goal. In order to do 32-bit thunking the right way we are spending bunch of time ensuring that we have a proper daya layout analysis system in place
that is based on clang to do a couple of things. This analysis will let us to automatic translations of data structure from 32-bit in to 64-bit and
also alert us if something needs to be manually translated. This needs to be in place because otherwise we can end up in a situation where we
unknowingly corrupt data and it would be a nightmare to find. So this month we now have the ability to annotate our thunk definitions and start having
clang work for us. While not complete, some of the work has shown to have thunking working for 32-bit Super Meat Boy to work! It's getting there!
NZCV usage preparation
A big performance improvement that FEX is working on is to use the CPU's flags to directly emulate the x86 flags when possible. This is a long and
arduous task but the performance improvements will be huge once the code lands! A bunch of prep work this month has landed to start down this path but
we're going to need to let this sit in the oven for a bit longer. Check back next month to see if we get there!
Minor optimizations
With XDC being in the middle of the month, it caused most of the bigger work to be delayed so we have a bunch of smaller things this month!
- Minor optimization to bfi/bfxil
- Removes one or two instructions for some instruction translations
- Optimize atomic fetch operations in to atomic if the result isn't used
- Removes a couple of instructions if the resulting fetch data isn't used.
- Implements support for ARM's new AFP extension
- Currently disabled until we can audit the codebase to ensure we aren't corrupting anything
- Lets us remove an insert after every scalar operation to match SSE behaviour
- Optimize palignr that behaves like a move
- Compilers shouldn't use this, but now we optimize it to a move
- Optimize pblendw
- A fairly uncommon instruction but now its implementation is basically as fast as it can be
- Optimize blendps
- We had already optimized blendpd last month, so this time was to optimize the 32-bit version
- Fairly commonly used so should improve perf in some games
- Optimize dpps and dppd
- These instructions do a dot product and a broadcast of their result but we couldn't find a game using it heavily
- So while this is now optimal, this is unlikely to affect any real game
- Optimize some 3DNow! instructions
- 3DNow! is a really old instruction extension that is basically only used in some really old games
- All of these instruction implementations are basically as fast as we can make them now, which is good!
- Optimize direction flag pointer offset calculation
- This converts a three instruction calculation down to one and stops using a ternary selection
- This happens with x86's repeat instructions, which typically happens for memcpy and memset
- Used a lot but is a minimal improvement.
- A few other random bits and bobs!
AVX optimizations!
While nothing supports our AVX implementation today, we have optimized a handful of implementations once hardware supports what we need. We have
optimized a smattering of instruction translations.
-
256-bit VExtr, VFCADD, VURAvg, VFDiv, VSMax, VSMin, VUMax, VUMin
-
Removes a bunch of truncating moves
- If we know an AVX instruction is operating at 128-bit width, we can remove a redundant move which speeds things up!
Raw Changes
FEX Release FEX-2311
-
ARMEmitter
-
Fix GPR fill mask in
FillStaticRegs
(3702e51) -
Arm64
-
Minor optimization to bfxil and bfi (8181e53)
-
ArmEmitter
-
Adds sized Scalar 1 source and 2 source helpers (2e1389b)
-
CPUID
-
Adds some missing cpu core names (ff3f734)
-
Config
-
Fixes string enum parser with multiple arguments (bbd20b4)
-
External
-
Remove a spurious license (26ee63c)
-
FEXCore
-
Removes gdb pause check handler (d4a6b03)
-
Fixes bug in vector
ZextAndMaskingElimination
pass (a261d99) -
Removes a warning about assume discarding side-effects (462fff2)
-
Renames raw FLAGS location names to signify they can't be used directly (8dab35c)
-
Implements support for RPRES (b2a8b0c)
-
Support crypto extensions in HostFeatures override (fc70fc3)
-
FileLoading
-
Updates helper to load file that is backed by memory (190f7c2)
-
IR
-
Changes over to automated IR dispatch generation (6543a80)
-
FEXLinuxTests
-
Adds a unittest for eflags and signals around a inlined syscall (b92e716)
-
Compile tests with masm=intel (b45023b)
-
Temporarily limit thunk test execution to 64-bit guests (1ea40ae)
-
FEXLoader
-
Query runtime page size (65e8d09)
-
GDBServer
-
Preparation work to get this moved to the frontend (e91c5ff)
-
GdbServer
-
Fixes returning thread names (1806519)
-
IR
-
Print assert code for IR EmitValidation (5431aa5)
-
Optimize unused result atomic fetch mop to just atomic mop (14e5ea1)
-
Adds scalar vector insert operations (6253f4f)
-
InstCountCI
-
Update rounds{s,d} classification (a287f2a)
-
Adds two missing variants of movd/movq (8538f5b)
-
Support disabling flagm extensions (6db2125)
-
Adds some multi instruction tests (4834236)
-
Support multiple instructions in the tests (f036a0b)
-
Adds missing atomic tests (a5f82a5)
-
Fixes recursive tests with same filename (9c36d10)
-
Support overriding AFP features (5a3cc7b)
-
JIT
-
Implements Print support for vixl sim (5b70209)
-
JITArm64
-
Fixes double munmap issue that was causing crashes (8ee5b5c)
-
Fixes bug in rpres scalar operations (8f8f376)
-
Linux
-
Fixes issue with *at syscalls with absolute paths not working (cf9c2aa)
-
Fixes warning in 32-bit clock_settime (https://github.com/FEX-Emu/FEX/commit...
FEX-2310
Read the blog post at FEX-Emu's Site!
Welcome back to another monthly release for FEX-Emu. You might be thinking that after last month's optimizations that we wouldn't have much to show
for this month. Well you would be wrong! We optimized even more! Let's get in to it!
More instruction optimizations!
As stated last month, we introduced Instruction Count CI which has allowed us to do targeted optimizations of our code. One again we have optimized so
many instructions that it would be impossible to go through each individual change. Check our detailed change log if you want to see all the
instructions optimized. Let's just look at the final benchmark numbers compared to last month.
<- Geekbench 5 versus last month ->
<- Bytemark versus last month ->
Let's talk about the Geekbench 5.4 results first since they don't look very
impressive at first glance. While we are only showing ~13% of a performance improvement, the problem with this result is that this number is an
aggregate of multiple smaller benchmarks. Looking at the breakdown of all the subtests there are some that have improved by up to 66%! This is of
course because some benchmarks take advantage of some instructions that we optimized more heavily than others. Luckily this improvement also scales to
other video games as well.
The Bytemark improvements are a bit hard to make out, some numbers are hardly changed at all while a couple stand out as huge improvements. This
mostly comes down to some very specific instruction optimizations that significantly improved performance in a couple of tests and the rest don't show
up as much.
With this months optimizations and last months combined these optimizations end up being significantly more interesting. Some
Geekbench results are showing an average of 50% to 65% higher performance
sometimes even higher. Some benchmark results showing nearly 2x the performance compared to before! These numbers translate very well to gaming
performance where some games have more than doubled their FPS over the past couple months.
We're not slowing down either, we still have a ton of optimizations to go on our march to get our emulation close to native performance.
Support preserve_all for interpreter fallbacks
We're calling out this particular optimization for three reasons.
- It improves performance of x87 heavy code
- It only works with the super recently released Clang 17
- wine packages in FEX's rootfs use x87 heavily in some instances.
Let's talk about what this optimization is and how it improves performance. In Clang 17 they added support for a new function calling ABI called
preserve_all. x86 has supported this ABI for a very long time but it is a new addition for Arm64. This ABI breaks convention from the regular AAPCS64
ABI in that if a small function needs to more registers then they need to first save pretty much any of them. Unlike AAPCS64 where it has a bunch of
registers free for using. This is beneficial for FEX's JIT since we can save signicant time by not saving any state when we need to jump out of the
JIT and execute x87 softfloat code.
In particular this manifests to upwards of a 200% performance improvement in some microbenchmarks around x87 code! While this advantage is quite
significant, the only way to take advantage of it is to compile FEX with Clang 17. Since this compiler release came out only last month, pretty much
no distros have adopted it so it is unlikely to be used soon. In a few months time, or years depending on distro, they should naturally upgrade their
compiler stack and free performance improvements will happen.
As a fairly major side note to this excursion, FEX has found that the 32-bit wine packages that is compiled with Canonical's repository uses x87
heavily in some instances. This causes some really bad performance issues with some 32-bit games and installers. It is recommended to use Proton where
you can here since it compiles its 32-bit libraries with SSE optimizations instead which work significantly better.
FEX-Emu may look to provide its own wine packages in the future with this same optimization in place to help alleviate some of this burden. Until then
it is recommended to use FEX's x87 reduced precision mode to try and alleviate some of the overhead.
Fixes a bug when chrooting in to rootfs
For quite a few months now FEX-Emu has changed some behaviour around chrooting in to the FEX rootfs.
While chrooting isn't generally advised, if a user wants to modify the rootfs then it's the only option. While we provide some scripts inside of our
rootfs images to facilitate this, it has been broken for a few months.
We have now fixed this bug in both FEX-Emu and the scripts inside of our rootfs images. So if you want to modify packages inside of the image you will
now be able to do so again. Make sure to update your image to get the new scripts!
Remove x86-64 JIT and Interpreter
This has been a long time coming in the FEX-Emu project. We have had support for an IR interpreter and x86-64 host JIT for compatibility testing since
the project's inception. It has always been the case that if these CPU backends get in the way of the ARM64 JIT that they would get removed.
That time has finally come. Due to some upcoming changes around how flags are getting represented in FEX's JIT and the general burden of implemented
FEX's IR operations three times, often undoing an x86->Arm64 translation to go back to x86. It has been deemed too much of a burden and these have
been removed. This is a necessary step for our ARM64 JIT to gain more performance that we will be gaining in the coming months!
We are looking forward to future ARM platforms that can take Radeon GPUs through PCIe slots to regain a platform which can test RADV directly, but
until that point we will have to make due with our current devices.
Instruction Count CI on x86-64 hosts
While we removed our x86-64 JIT, we do have a fun addition to our instruction count CI. Now developers that don't have an Arm64 device handy can still
run the Instruction Count CI and attempt to optimize implementations without even having an ARM64 device to run it on. This is as simple as building
FEX on an x86-64 device with the Vixl disassembler and simulator enabled and you will be able to optimize to your hearts content!
We've got a need for JIT speed! Let's go fast!
Implement first optimizations using 128-bit SVE
This is a fairly minor change but previously FEX was not using any 128-bit SVE instructions. This is primarily because there aren't really any SVE
supporting devices in the consumer market, even though Snapdragon hardware theoretically supports it. 128-bit SVE adds a couple of optimizations that
we can use.
- Wide-element shifts
- Index instruction for generating simple index masks
While these are fairly simple initially, they change some from being translated to six instructions down to one or two depending. This is a fairly
minor change, but it is good to note that FEX is now taking advantage of SVE if it is available!
Adds WOW64 frontend
This has been a long time coming, with us adding initial mingw support back in FEX-2305. FEXCore now supports being built with a brand new WOW64 WINE
frontend. While currently not being utilized, this will allow WINE to integrate FEX directly in to its WOW64 layer for running both x86 and x86-64
applications on Arm64 host devices.
This is a very substantial change to how WINE integrates with FEX, since today FEX-Emu just runs the full x86-64 WINE process and eats the overhead of
emulating everything WINE needs to do. With the WOW64 layer now implemented, a bunch of the WINE code can now be Arm64 native code and when it needs
to execute application code it just jumps back to the emulator. This is similar to how Windows natively handles its emulation through its "XTA" layer.
Sadly today this is only wired up to work through a 32-bit x86 part of the layer, we need to get setup to support Wine when it inevitably supports
Wow64 for x86_64->Arm64.
Big shout out to ByLaws implementing support for this! We look forward to future Wine integration work landing!
Implement thunking support for wayland-client and zink
We have some improvements to thunking this month! As we are working towards supporting thunking more code, we implemented some features to get
wayland-client thunking wired up. While this support is early, it is enough to get Super Meat Boy up and running using wayland and zink overrides
within a Wayland environment. We look forward to additional thunking improvements going forward so that performance can be improved everywhere.
Raw Changes
FEX Release FEX-2310
-
AppConfig
-
Removes Steam config (02da6d6)
-
Arm64
-
Fixes inline syscalls (4e9a114)
-
Optimize wide shifts slightly for 64-bit OpSize (f5c4e28)
-
Recover two unused vector vector temporary registers (90f7937)
-
ALUOps
-
Remove spills in PEXT (4604c01)
-
VectorOps
-
Elide moves where applicable in 128-bit VSQXTUN2 (fd1b639)
-
Improve handling of 128-bit vector VInsElement (950a8db)
-
Elide moves in ASIMD VUShrNI2 if possible (b3269f2)
-
Assert VTMP1 and VTMP2 are sequential in VTBL2 (8168a49)
-
Fix SVE aliasing-path move in VSShr (ffb5876)
-
CI
-
Run tests with <30s runtime first (e1eb151)
-
CPUID
-
Enabled Enhanced REP MOVSB/STOSB (6fe643d)
-
Config
-
Fixes core sanitization (da3e172)
-
ConstProp
-
Fixes unscaled signed 9-bit range (72d092e)
-
DeadContextStoreElimination
-
Silence unused function warning (773e946)
-
ELFCodeLoader
-
Expose FSGSBase in getauxval HWCAP2 (fbc4bda)
-
FEX
-
Moves Linux u...
FEX-2309.1
Hotfix patch to fix a bug around accessing files!
FEX-2309
Read the blog post at FEX-Emu's Site!
Last month we hinted that we didn't get all optimizations in that we wanted. There's more of that this month but we have also had an entire month to
push optimizations in. This month was a whirlwind of optimizations improving performance all over the place because of one feature that landed;
Instruction Count continous integration! Let's dive in to what this is.
Instruction Count CI
This is a major feature that we added last month that doesn't directly affect users but is such a huge quality of life improvement to our developers
that we need to discuss what it is. At its core, InstCountCI is a database (Actually JSON) of x86 instructions that FEX-Emu supports and shows how
that instruction gets converted to Arm64 instructions. This is in textual format for easily reading these instruction implementations and updating
quickly when the implementation changes. This has had a profound effect on our developers where they can't help but look at poor instruction
implementations and finding ways to optimize them.
<- Optimized versus non-optimized picture ->
As you can see in the example, one very complex instruction that was not optimal before has now translated in to something much more reasonable.
So far this has nerdsniped at least half a dozen developers in to finding more optimal implementations of these instruction translations!
Some design considerations of this must be understood when looking at FEX's instruction implementations although. The most important thing to remember
is that these implementations are looking at the instruction in a vacuum. These are translated as only single instruction entities, so any sort of
multi-instruction optimization is not going to be visible in this CI system. Additionally this isn't getting run on hardware in our CI, so
implementations that are close on instruction count may have wildly different performance characteristics depending on the hardware. So while it is a
good guide for getting eyes on the assembly, there still needs to be some knowledge as for what the translation is doing to ensure it's both fast and
correct.
This CI system was used heavily this last month for what our next topic is.
Optimization Extravaganza!
With InstCountCI in place, we can now quantify optimizations going in to the FEX CPU JIT without accidentally compromising performance of other
instructions. With this in-place we have had an absolute ton of CPU optimizations land in our JIT, enough that if we went through them all it would
take longer than all of previous progress reports!
Instead of going through each individual change, let's just discuss the main optimizations that have landed. The bulk of optimizations has
been making sure the translation between SSE instructions to Arm64's ASIMD instructions is more optimal. This is because reasoning about vector
optimizations is easier in this instance, and also because games more heavily abuse vector instructions than regular desktop applications. There were
other optimizations like some flag generation instructions becoming more optimal and eliminating redundant move instructions as well!
Let's take a look at the bytemark results.
<- Bytemark graphs ->
There's some surprising uplift in numbers here! Even more so since bytemark shouldn't heavily utilize SSE instructions so this is more just coming
from general optimizations that occured. Let's take a look at another benchmark for fun.
<- Geekbench 5.4.0 graph ->
Whoa, that is a surprising uplift in one month! Geekbench actually has some
benchmarks that use vector operations so they can get improvements more improvements than expected. We should expect even more performance once we
start optimizing more non-vector instruction translations!
As for gaming benchmarks, we're not going to do some in this blog post, but we have been told that due to various optimizations this month that Portal
performance has gained 30% and Oblivion has 50%. Big improvements towards making games feel better when playing them. Main concern here is that the
Adreno 690 in our Lenovo X13s test systems are actually quite unstable during testing, so finding suitable games that are CPU bound without crashing
the kernel driver is surprisingly difficult. Most of the lighter games that don't crash the MSM kernel driver are already running at hundreds of FPS
anyway so it isn't interesting.
A fun quirk of optimizing vector operations this month, we have finally landed our first optimizations that use ARM's SVE instruction set when
operating at 128-bit width. Turns out there are a few optimizations that can be done here aside from implementing AVX with the 256-bit version! I'm
sure we will see more of these as we continue optimizing.
Remove most implicit sized IR operations
Continuing from the last topic, this is one of the main changes that allows us to start working on non-vector instruction optimizations. FEX's IR
around general purpose ALU operations has a history of using implicit sized IR operations. This means we would check the size of the incoming data
sources and make an assumption for what the operating size of the whole thing should be. While this worked, it has been an absolute thorn in our side
for years at this point. Any time we would make a seemingly innocuous change it would subtly change the behaviour of some IR operations as a new size
propagates through the stack. Now that all of these operations explicitly state their operating size at generation time there is less room for error.
This follows with how our vector operations worked, where all of these were explicitly sized from the start and has had significantly less issues over
time.
With this change in place we can start optimizing general purpose ALU operations with less worry about breaking the world.
Mingw work
Some more work this month towards getting WINE WOW64 support wired up. Adding a toolchain file to help
facilitate cross compiling, stop saving and restoring the x18 platform register and various other things. While full support isn't yet merged, there's
a lot of preliminary work landing so we can support this. While this work is very early, it is already showing significant performance improvements
for Windows native games. A game like Bioshock Infinite is already running faster than FEX emulating x86 WINE fully! Look forward to future
improvements and integrations as this gets wired up!
Raw Changes
FEX Release FEX-2309
-
ARM64
-
Optimize vector zeroing (eaed5c4)
-
ARMEmitter
-
Handle SVE load and broadcast quadword groups (a9dea29)
-
Handle SVE load and broadcast element group (710a392)
-
Handle load/store multiple structures (scalar plus scalar) groups (eda67eb)
-
Handle SVE ADR (139dd4c)
-
Handle SVE CPY (immediate) (72357e5)
-
Migrate off vixl float utils (5a0a6dd)
-
Handle SVE FCPY (predicated) (8fce133)
-
Remove resolved TODO comment (5de7eee)
-
Handle contiguous first fault load (scalar plus scalar) group (0109e88)
-
Handle SVE FP multiply-add long groups (6f4a23d)
-
Arm64
-
Only allocate vixl::Decoder if enabled (2d78b1f)
-
Optimize AES operations by caching a zero register (7f99738)
-
Optimize AESKeyGenAssist (02b891c)
-
Optimize VFMin/VFMax (0819338)
-
Optimize SVE VInsElement (1f2c5fc)
-
Stop abusing orr in LoadConstant (6d562f8)
-
Optimize non-optimal BFI move case (1029bb1)
-
Optimize CacheLine{Clear,Clean} (1343c14)
-
Adds stats to the disassembly (53ac8ab)
-
Implement first SVE-128bit optimization (fe35135)
-
Remove erroneous LoadConstant (c4c7620)
-
ConversionOps
-
Remove redundant moves in AdvSIMD VInsGPR (6e4765d)
-
Add missing half-precision conversions to scalar functions (172c8f3)
-
Add scalar support to Vector_FToI (a62ba75)
-
EncryptionOps
-
Use MOVI reg, #0 to zero vectors (4286d44)
-
VectorOps
-
Remove redundant moves in SVE VExt...
FEX-2308
Read the blog post at FEX-Emu's Site!
Whoa jeez, another month already? We've had our heads down working hard this last month, trying to make FEX-Emu the greatest x86/x86-64 emulator on
Linux. A huge focus this month is optimizations because of course what we want is to go fast. We're all cats and we've got the zoomies.
Every day we're optimizing
As said before, this month has been an absolute mess of optimizations as we've been optimizing the project as thoroughly as possible. We could spend
another month talking about the optimizations that we did this last month, so let's blast through what we did. First let's show a graph for how much
FEX has improved over this last month.
Look at those numbers! Some benchmarks from bytemark have cracked the 200% mark! While a couple benchmarks do have regressions, we're pretty sure that
we know what they are and they will be rectified soon. These are the sorts of optimizations that can be felt in real games though.
So lets quickly run through some of the optimizations we ran in to this last month.
Switch to using half-barriers for memory accesses
When ARM hits an unaligned atomic memory access, we previous wrapped that load or store in two slow barrier instructions. We can now safely only use
one barrier on one half of the instruction! This makes unaligned accesses quite a bit quicker.
Optimize x87 memory accesses
This removes a couple instructions when we access 80-bit floats.
Only clear icache for code
Some large code blocks can generate a decent amount of metadata that don't need an icache clear. Can remove a bit of stutter.
Const prop BFI operation
Sometimes when a BFI instruction has constants in it, we can remove the BFI instruction
Optimize vector TSO loadstores
vector operations typically need an additional add on its address if it can't fit in the instruction encoding for the immediate offset. We missed the
optimization in which the immediate offset CAN actually fit. Removes an instruction per vector loadstore commonly
Use TST instead of CMN
Sometimes these instructions hit a slow path on Cortex-A57 so a minor win there.
Optimize xor reg, reg
x86's universally agreed upon instruction for generating zero in a register is xor. This instruction isn't actually optimal in ARM hardware. We now
emit a move of constant zero which gets optimized to register rename on most ARM hardware.
More instructions optimized
These mostly just make the implementations use less instructions which makes them faster. There will be way more of this in the coming month
- rotate flag calculations
- phsubsw/phaddsw
- cmpxchg8b/16b
- psad*
- 8-bit, 16-bit rcr
- fcmov
- shld/shrd
- movss
- maskmovdqu
- maskmovq
- phminposuw
- fild
- PF flag calculation optimization
- Optimizing packing RFLAGS
- Optimize ADD/ADC OF flag packing
Fixes bug in SSE4.2 pcmpestri
This was causing Java applications to crash. Now that we fixed a different bug last month, we now have Java working to an extent. It still crashes on
shutdown which is interesting and not all games are expected to work. But good luck testing random Java games!
Pack NZCV flags
This is the first step towards FEX generating x86 flags in a more optimal way. These flags match the ARM flags fairly closely and can be emulated in a
more optimal way if we pack them together. This is likely what causes the regression in bytemark, but since this is an intermediate step it is
expected to go away with the next optimization after this. Look forward to future optimizations that make this faster!
Remove weak symbol declarations in thunks
A bug that cropped up in thunks has been a crash that occurs when trying to use thunks from Ubuntu's PPA system. This has been a major thorn in FEX's
side for months because once you rebuild the project locally, it would never reproduce. The problem stems from the fact that clang would decide that
it can inline a "weak" symbol if its implementation is visible. This would only occur on Canonical's ARM builders, potentially due to whatever device
they use to compile the code on. This would cause our thunks to crash almost immediately if a user tried them from the PPA system. We have now worked
around this clang quirk and this will now fix thunks when enabled from the PPA system.
Mingw build work
As part of FEX's effort towards supporting running as a WINE dll, we have been slowly adding support for compiling FEXCore as a Windows DLL.
This month we have removed a bunch of Linux assumptions and API usages from FEXCore and moved it to the frontend FEXInterpreter application. In doing
so, FEXCore can now be compiled using llvm-mingw as a WINE specific DLL. This is completely unusable for users today but sets the groundwork towards
what will eventually become a WoW64 integration in the future. We have also added mingw building of FEXCore to our CI so we ensure it doesn't get
broken.
To be clear, even though this work allows us to compile as a Windows DLL, this doesn't allow us to run under Windows. FEX still does a bunch of things
that are Linux specific inside of the code.
ARMEmitter cleanups
Another improvement that doesn't affect our users but good to shoutout the improvement for our developers. @Lioncache
has spent a good amount of time this last month adding missing instructions and aliases to our AArch64 code emitter. While our code emitter covers a
decent amount of the AArch64 instruction space, it takes time to ensure full coverage. Whenever we're writing code for our JIT and an instruction is
missing, it slows down whatever we are working on. So kudos for improving our coverage because it makes everyone's lives easier.
Implement missing accept4, recvmmsg, sendmmsg for 32-bit socketcall
In a recent Steam client update, it started using accept4 for some background thing. This would cause it to spam a bunch of logs when failing to
accept some connection. A simple fix just for a few missing system calls, Steam now no longer is complaining loudly.
Fix variadic packing in X11 thunking
WINE had broken X11 thunking for all of FEX's history without any indication as to why. We never had time to look in to this but this last month we
finally hit a game that crashed which made this easier to debug. This bug occured because WINE is one of the few applications that pass more than
seven arguments through a few variadic API calls. This triggered a bug in FEX's variadic repacking code once we starting packing the arguments on to
the stack. With this fixed, WINE X11 thunking now works in significantly more games. This means that both OpenGL and Vulkan applications can be
thunked under WINE.
Fixes dead context store elimination pass
This optimization pass removes redundant stores to FEX's CPU context state. While this usually doesn't save much, it can improve performance for some
edge cases in FEX's JIT. While this is a performance optimization, it likely won't affect many things.
Fix 16-bit POPA instruction
This instruction was accidentally zero extending the 16-bit value in to the 32-bit register. We now insert the 16-bits as expected. This fixes an
issue with OpenAL in some cases.
Raw Changes
-
ARMEmitter
-
Add missing atomic aliases (68cb6e6)
-
Add cinc/cinv/csetm aliases (30ab4d3)
-
Add ngc/ngcs aliases (eebcbfd)
-
Add bfc/bfxil aliases (d2bca9b)
-
Add sbfiz/tst/ubfiz aliases (4681061)
-
Finish off remaining SVE Integer Wide Immediate - Unpredicated categories (2fc6542)
-
Implement cmn alias (1ce0ea8)
-
Arm64
-
Switch to using half barriers (94273fb)
-
Fixes LR corruption in 128-bit divides (5821175)
-
Optimize {Load,Store}ContextIndexed address generation (536b2ed)
-
Only clear icache for code (0674dfa)
-
Emitter: Handle LD1{}/LDFF1{} Vector + Immediate encodings (421214e)
-
Emitter
-
Add remaining missing SVE predicate range assertions (64c7243)
-
Deduplicate some more SVE implementations pt. 2 (0a1820d)
-
Deduplicate some more SVE implementations (9c175da)
-
Reorganize some base opcode and assert locations (0be68a5)
-
Simplify SVE immediate shift helper (e689c6f)
-
Collapse encoding cases for indexed dup (d6697fc)
-
Handle SVE FP convert precision group (e633ef7)
-
Handle SVE FP arithmetic with immediate (predicated) group (754bc18)
-
Handle SVE XAR (842b71c)
-
Add helper...
FEX-2307
Read the blog post at FEX-Emu's Site!
This release we had a bit of a slower month as some larger pieces were being worked on, but we still have some good stuff that is worth talking about.
Implement per-instruction RIP reconstruction
This was a fairly curious bug that FEX encountered. When trying to run the game Ultimate Chicken Horse then the game would crash very in its startup.
While investigating the game we determined that this was one of the first games we tested that uses Unity Engine's AOT scripting reflection(?) mechanism. This codepath seemingly heavily relies on either tagged pointers or some other
mechanism that causes a SIGSEGV when accessing it the first time. After that point the Unity AOT will catch the SIGSEGV and depending on the RIP of
the instruction, it will change behaviour. One of the problems with FEX is that on synchronous faults like SIGSEGV, we don't yet support full state
reconstruction. Since it seems like this only relies on RIP being correct, we can fairly easily wire this up and get Ultimate Chicken Horse running!
AVX work completed!
This last month FEX has done the last remaining work to implement AVX. With this month the remaining SSE4.2 instructions were finished,
and the prerequisite XSAVE and XRSTOR instructions were implemented. Although while the feature is effectively complete we aren't yet enabling the
CPUID bit yet. We are wanting to investigate a potential crash that has cropped up in Java games due to the extension first, and additionally we want
to finish up AVX2 work and enable them both in one step! Next month is looking to be the first version with AVX and AVX2 support in the source.
Fix 32-bit robust futex fetching
This issue has been a thorn in our sides for quite a while now. Usually this only ever manifested as an issue if the user was running Steam using
FEX's official PPA binaries in their setup. Once the user tried running Steam, then it would
crash with a really obscure message about "Fata error: futex robust_list not initialized by pthreads." This was something that would then never
reproduce if the code was rebuilt locally.
With a bit of poking around and using a local pbuilder version of FEX we were finally able to reproduce the error. Turns out FEX was writing a 64-bit
pointer back in to the result when the application tried querying the robust list pointer, overwriting part of the stack and corrupting its data.
This falls under one of the circumstances of "How did this ever work!?" but now with it resolved, theoretically Steam should finally work for our
users that are using the PPA build of FEX. Enjoy~!
Fix application hangs due to mutexes being locked on forks
This has been a very spicy bug that has been haunting FEX for years at this point. Whenever an application in modern day wants to execute a process it
will use a combination of fork and execve. Fork might end up being a vfork, or might end up being a clone syscall that does the same thing. Regardless
fork when executing in a threaded environment has some very strict requirements that it basically can only do an execve afterwards. vfork even adds an
additional restriction that it can't corrupt the stack at all because its sharing memory space.
The problem with this approach is that even if the application is only ever going to call execve after the fact, FEX needs to do a bunch of
bookkeeping or additional JIT emission and execution. This causes the problem that FEX's mutexes might end up being in an unknown state going in to a
fork, which will cause this new child process to hang indefinitely on the mutex.
To work around this issue, FEX will now globally lock all mutexes that matter, do the fork, and then immediately unlock the mutexes on the parent
side. On the child process FEX needs to be a bit mean to these mutexes, resetting them to zero to ensure no thread is holding the mutex. While this is
fairly heavy-handed this dramatically reduces how frequently FEX hangs when fork is used.
Specifically Steam tended to launch a bunch of background processes which would hang indefinitely, causing Proton games or downloads to never
continue. This should pretty much entirely be fixed!
Stop using faccessat2 to emulated faccessat
This was an oops on our part. faccessat2 was added in Linux kernel 5.8, so if your device was running an older kernel this syscall would /never/ work.
We didn't notice this since most of our devices are running a new enough kernel that faccessat2 just worked.
Thanks to the user that found this problem!
Handle xattr syscalls with overlayfs rootfs
Turns out that FEX had missed the various syscalls that access files to get xattr information. This was causing weird failures where some applications
would say that a file doesn't exist purely because it was in the rootfs overlay only.
Sadly Linux doesn't support *at variants of these syscalls so they aren't quite as fast as native execution, but that's fine.
Fix conflicting ARM64 register allocation
A couple months ago we added one more register to our register allocation for slightly more optimal register allocation. This broke a game called
Osmos under FEX. This is purely a bug but in resolving it, we likely fixed crashes in various
applications that we didn't notice before. Oops!
RootFS additions
This month we have a couple new rootfs images on our server that have been hotly requested! We now have an ArchLinux rootfs image and a Fedora 38
rootfs image. These haven't been as thoroughly tested as our Ubuntu images so if you find any problems with them, make sure to let us know on our
Discord
Raw Changes
-
Arm64
-
Fixes paranoidtso option for CPUs that support LRCPC/2 (e5189d6)
-
Fixes GPR pair allocation to get one pair back (16f7002)
-
Fixes register pair conflict. (7c47296)
-
Context
-
Removes dead
AddVirtualMemoryMapping
function (c3e123d) -
Emitter
-
Adds support for CSSC (8047007)
-
External
-
Update jemalloc trees (f872199)
-
jemalloc
-
Updates external jemallocs (1a4d5a1)
-
Externals
-
Update fmt to 10.0.0 (7ee6fc0)
-
FEXServerClient
-
Ensure server socket is created with SOCK_CLOEXEC (e86a792)
-
FHU
-
Workaround libstdc++ version 13+ bug (8a4c5bc)
-
IR
-
Move VPCMPESTRX REX handling to OpcodeDispatcher (f39163b)
-
Pad IROp_Header to be 32-bit in width (fe06f1b)
-
JIT
-
Implement support for per-instruction RIP reconstruction (9dcc1de)
-
Linux
-
Fixes hangs due to mutexes locked while fork happens. (e72fa02)
-
Handle xattr syscalls with emulated paths. (5a53931)
-
Stop using faccessat2 for faccessat emulation (f444b03)
-
Remove warning that isn't necessary anymore (cac7985)
-
Optimize CalculateHostKernelVersion (1506a19)
-
OpcodeDispatcher
-
Ensure MXCSR is saved/restored with FXSAVE/FXRSTOR (66d4206)
-
Handle XSAVE/XRSTOR (a082161)
-
Scripts
-
Disable using catchsegv if it doesn't exist (d6c9b54)
-
VectorFallbacks
-
Fix PCMPSTR fallback ZF/SF flag setting (9b5e1c4)
-
Misc
-
Some small fixes for android building (d2032da)
-
Move config layers to the frontend (2997257)
-
unittests
-
Add include search path for asm tests (cb8bf1a)
-
x32
-
Thread
-
Fixes robust futex fetching (e652399)
FEX-2306
Read the blog post at FEX-Emu's Site!
Another interesting month of changes for FEX-Emu! While this release is shorter than last, this also only has a month of work rather than two. We had
some great work done this month, including a bunch of plumbing that most people won't notice. Let's see what changed!
Adds support for hardware TSO memory emulation prctl
Emulating the x86 memory model is the number one thing that slows down FEX emulation today. Apple Silicon supports this memory model in hardware which
is why Rosetta on MacOS can get amazing performance. With some recent changes from the
Asahi developers, FEX can ask the hardware to enable the TSO emulation bit. If the kernel reports back that hardware TSO memory is enabled, FEX can
take a more lax approach to its memory emulation, getting an automatic speedup for Asahi systems.
Additionally not only is this a speed-up, it's required for correct emulation. When this feature isn't supported by the hardware, FEX needs to emulate
the memory model using atomics and LRCPC instructions. This absolutely demolishes performance so it is usually recommended to disable the emulation to
get "free" performance. This issue with this is that it can crash instability in the most awkward and peculiar of ways. We even found out in this last
month that Unity games with their complex buffer management are highly likely to crash due to old cachelines of data hanging around. The only fix is
to use emulate the memory model using hardware TSO flags or our atomic/LRCPC path. Sadly's ARM Cortex's hardware LRCPC implementation is barely any
faster than atomics.
Steamwebhelper crash fix
New beta versions of Steam has started relying on AT_EXECFN existing. FEX didn't previously emulate this auxv value which was causing it to crash. With this fixed, steamwebhelper is now working again.
More AVX work!
This has been a long time coming and we're almost there finally. After these changes FEX only needs to fix a few implementation bugs in the string
operations, and implement the XSAVE instructions before allowing AVX emulation. In addition to that, FEX is also almost able to supprot AVX2 with the
only instructions that need to be implemented is the gather load instructions!
Implements support for XGETBV
This is a fairly simple instruction as it lets the application query which CPU features are enabled. Necessary for an application to check before
enabling any AVX usage.
Handle PCMPESTRM, PCMPISTRM and AVX variants
These are the remaining string instructions that FEX has implemented. While mostly implemented there are still a couple of edge case behaviours that
aren't quite correct and just need to be fixed.
Implement support for deferring asynchronous signals
This change has been a long time coming to make FEX's JIT faster in the face of handling asynchronous signals. This issue is that FEX needs to enter
code regions that are effectively "uninterruptible" until it is complete. This is basically a reentrancy problem where a piece of code executing could
lock a mutex, or non-atomically updating a container's data, then when a signal occurs it will jump out of the code and potentially come back to this
corrupted state.
As an initial workaround to this problem, FEX would just disable all signals in each of these "signal-deferring regions." This had the overhead that
every region would have a system call going in to it and then another one coming out of it. If how frequently these regions happened was little then
it would be a non-issue; but as is commonly the case, FEX's JIT needs to be wrapped in this signal blocks. If a game is executing a bunch of code,
this means we can be doing thousands of additional system calls per second which adds up as direct overhead.
With this new change, FEX marks that it is in a signal-deferring region with some very cheap memory accesses and if a signal doesn't occur the
overhead is negligible. In the case that a signal does occur, it will get stored to a queue, FEX will finish its signal-deferring region, and then
come back to handle the signal.
This mostly works because asynchronous signals don't have guarantees about the timeliness of the signal being delivered. Sadly this can result in
signal queue depths being subtly incorrect but we are monitoring the situation to know if any game is affected. All in all this finally makes it so
FEX can be straced without being overwhelmed and improves stutter problems!
Grand Theft Auto 5 AVX fix
FEX was accidentally reporting support for BMI1 and BMI2 CPU instructions. These extensions have a requirement that AVX must be implemented for these.
This was causing Grand Theft Auto 5 to crash early trying to use AVX. We will now only report these extensions if emulated AVX is supported, which fixes this game.
Make vfork wait for the process to exit
FEX's previous implementation of vfork actually behaved like fork. The difference between these two syscalls is fairly subtle. In the case of vfork, the parent process will end up sleeping until the child either exits or executes execve.
We were instead treating this like a fork, where the parent continues immediately without waiting. While no known issues were encountered, it is good
to ensure this behaviour is correct for future work.
Getdents optimization
This classic syscall is used for querying directory contents, FEX needs to emulate this syscall since 64-bit applications now use a new syscall called
getdents64. FEX's original implementation was fairly slow due to a misunderstanding as to how this syscall worked. It would create a temporary working
buffer and copy data around a couple of times. With the new implementation it is able to use the buffer provided by the application and doing some
minor fixups to make the overhead fairly light now. This improves performance when an application is doing heavy folder scanning, which mostly means
it improves Proton startup time.
Minor optimizations
There were a handful of minor optimizations that improve performance so minorly that it falls within noise, but is nice to have.
Optimize ARM64 thunk trampolines
This is a very small optimization that changes an indirect load in to a PC relative load, removing a single data dependency.
Minor x87 FCMOV optimization
FEX was duplicating a mask from a GPR in to a vector register using two instruction and now it only uses one instruction.
Optimize ADC/ADD OF flag calculation
This was a small mistake where a bitwise negate was using two instructions instead of one.
Optimize EFLAG unpacking
Each time FEX was unpacking the EFLAG register it was using four instructions per bit of the flag. This has now been improved to only two. Cutting
flag unpacking to 82% of its original size in some edge cases.
Supported emulated Linux kernel version up to 6.2
FEX used to max out the reported kernel version up to 5.18. Now we can report up to 6.2 with this change. 6.3 is going to be harder since it
introduces a new prctl that FEX needs to work around.
Video game showcase
As said previously about Unity engine games having issues without TSO emulation. Here is a clip of Hollow Knight running under FEX full speed on a
Lenovo X13s. Even with the overhead of emulating the x86-TSO memory model, this game runs remarkably well. With x86-TSO emulation disabled this game
would have crashed a few seconds in.
Raw Changes
-
AOTIR
-
Stop passing a mutex around. It's already guarded (f47caf4)
-
ARM64
-
Fixes SRA disabled codepath (dc65a5e)
-
CPUID
-
Only enable BMI1 and BMI2 if AVX is supported (cc7a56b)
-
Context
-
Remove debug namespace (5be798e)
-
ELFCodeLoader
-
Fixes missing AT_EXECFN (00dc373)
-
FEXConfig
-
Removes Emulated CPU cores option (1a9b6a8)
-
FEXCore
-
Implements support for xgetbv (737f917)
-
Support Wine syscalls (77e8be1)
-
Convert Core and Telemetry over to fextl::file::File (5674d3a)
-
Adds support for hardware x86-TSO prctl (ed69eb9)
-
FEXLoader
-
Allow simulated kernel version up to 6.2 (ada226b)
-
FEXRootFSFetcher
-
Support rolling release distros (9473025)
-
IRDumper
-
Fixes ssa number in arguments. (0f4a5ed)
-
InstallFEX
-
Updates helper install script for Ubuntu 23.04 (02f15f4)
-
Linux
-
Make vfork act more similar to how it should. (95b7592)
-
OpcodeDispatcher
-
Optimize ADC/ADD OF flag calculation (5b58082)
-
Optimize EFLAG unpacking (69181d4)
-
Handle PCMPESTRM/VPCMPESTRM (182010c)
-
Handle PCMPISTRM/VPCMPISTRM (https://...
FEX-2305
Read the blog post at FEX-Emu's Site!
Welcome back to another release of FEX-Emu! We had cancelled last month's release due to a large amount of code churn happening. In order to ensure
the highest quality of stability we were forced to do so. Now we're back with an even lengthier release this month, so buckle up because there were a
large number of changes that happened.
More AVX Work!
These last two months have been a while ride towards implementing AVX. @Lioncache has been burning down a ton of
instructions to get everything in place for AVX emulation.
New instructions implemented
- PCMPISTRI/VPCMPISTRI
- VPMASKMOVD/VPMASKMOVQ
- VCVTPD2PS/VCVTPS2PD
- VCVTSD2SS/VCVTSS2SD
- PCMPESTRI/VPCMPESTRI
- VMPSADBW
- VPSLLVD/VPSLLVQ
- VPSRLVD/VPSRLVQ
- VCVTSI2SD/VCVTSI2SS
- VPINSRB/VPINSRD/VPINSRQ/VPINSRW
- VPSADBW
- VTESTPD/VTESTPS
- VPMADDUBSW
- VPMOVMSKB
- VMASKMOVPD/VMASKMOVPS
That's a whole bunch of instructions implemented! We have now nearly implemented all the instructions required for AVX.
The two major instructions before AVX can be exposed is the SSE4.2 instructions VPCMPISTRI and VPCMPESTRM. This is because these two
instructions also have AVX versions so it is a required feature in order to support AVX.
We are getting really close and once this feature is done, we can quickly move on to finishing support for AVX2, F16C, and the fused
multiply-accumulate extensions. At that point our CPU emulation will be effectively "feature-complete" for everything that games will care about in
the short-term. Exciting times!
llvm-mingw and WINE support
This is a very big change that has been coming down the pipe for a while now. We have been mostly working behind the scenes to get FEX-Emu wired up so
that it can be compiled as a Windows shared library. This last month is where this work has finally come to a head and most of the work is in place
for this.
How this works is that FEX-Emu has a shared-library and static-library that gets compiled called FEXCore
. This is where all the CPU emulation
happens and tries to be mostly OS agnostic, while everything that is Linux specific lives in the frontend called FEXInterpreter
. Is is FEXCore now
that can be compiled as a Windows AArch64 PE library. While this isn't currently useful to end users today. This means that WINE can link to this
library for emulating x86/x86-64 on AArch64 platforms. It should be noted that there are still some Linux assumptions strewn about the code, so this
isn't a generic solution for emulation on a true Windows platform. We're writing this support specifically for WINE today.
Converting away from C++ containers that allocate memory
This is the significant change that caused us to cancel last month's release. While @Neobrain was writing code to
support 32-bit library thunking, they had discovered a very big problem. FEX-Emu has long overridden the glibc memory allocation routines in order for
us to ensure that FEX can allocate memory when emulating 32-bit applications. We discovered that this overriding also extends to system libraries that
we load in after the fact. This meant that any time libGL would allocate memory, it would end up being a 64-bit pointer and there was nothing we could
do about it.
The workaround for this problem is to stop overriding the system allocators, which will allow shared libraries to allocate memory that can safely be
used by the 32-bit guest. But this also has the problem that FEX would then run out of memory when executing 32-bit applications. This is due to a
quirk that FEX-Emu needs to allocate all the memory on the system before executing 32-bit applications.
The new workaround is to replace usage of every C++ container that allocates memory with FEX's own container that will use its own allocator. This was
an exceedingly invasive change that touches almost everything in our codebase. With the pain done, FEX now can use its own internal allocators while
system libraries will use the regular glibc allocator as expected. See more about the limitations of this with our
documentation.
Re-enable glibc allocator hooking again
Okay, the previous paragraph was a ruse; FEX-Emu needed to actually override the glibc allocator again. In this case FEX-Emu will actually have three
allocators active at any given moment.
- FEX-Emu uses jemalloc for its internal allocator.
- The system allocator is overridden with another jemalloc allocator.
- The guest application's glibc allocator is untouched.
The problems start occuring when a pointer is shared between thunks and the guest application. If one allocator tries to free a pointer from a
different allocator then fireworks occur. The way around this is to use a jemalloc function to determine if it owns the pointer and choose which
allocator to end up freeing the pointer from. This is particularly painful with X11 thunking because pointers are passed between client and server in
a very laissez faire fashion. This may not stay around in the future but it is a necessary evil for now.
JIT Optimizations and improvements
Reclaim static assigned registers on 32-bit
This allows us to use 8 more general purpose registers and 8 more floating point registers with 32-bit applications. Depending on the game this can
improve performance by a decent margin. We have seen upwards of 20% performance uplift in various games due to it.
Fix Visual C++ redistributable crashing
This was a really annoying bug, where every.single.time. that Proton would run, it would try to install the C++ runtime at least four times. The user
would be required to kill the processes after they were installed. This was fairly egregious because we had thought it was fixed months ago and didn't
realize that it wasn't actually fixed. Depending on the version of the Visual C++ redistributable and Proton it would still occur.
Root causing this issue turns out that the redistributable uses Windows' structured exception handling to catch the case when it passes a null pointer
to strlen
which results in a SIGSEGV on the Linux side. FEX was incorrectly saving and restoring state when this occured, which caused it to
infinitely loop and crash. Now that this is fixed, these install correctly and Proton doesn't try doing it on every single run.
Implement REP MOVS as a memcpy
This instruction behaves like a fairly fast memory copy on the CPU. We now convert this over to an internal memory copy operation.
Similar to last month where we converted an instruction to a memset, this instruction being implemented as an IR operation has many times over
performance improvements. In real games this usually translates to a few percentage FPS improvement which is a nice uplift.
Fix restoring of AVX state
While not actually being utilized today (Except due to a bug), @AndreRH found out that we were accidentally failing to
restore AVX register state when a signal handler returned. It's surprising that this wasn't noticed earlier but it could have resulted in some really
bad floating point state.
Remove double syscall overhead on filesystem accesses
When FEX was checking to see if a file exists in the overlayfs style rootfs image we provide, we need to check if the file exists there first. If the
file exists we will redirect the file to be opened from the rootfs instead of the host filesystem. We had an issue that if the file didn't exist, we
would then check for it again on accident before accessing the host file. This would mean that one syscall turned in to three. With this fix in place
we are now only converting it in to two.
If you're running a rootfs image off of a particularly slow drive (or a network share) then this can shave a decent amount of time off of load times.
This was particularly noticeable when running a Proton game under Steam because they will access a ton of files before starting up.
Adds default DRM ioctl interface
This is a fairly basic change. Instead of breaking when hitting an unknown ioctl, pass it to the kernel and hope for the best. This is mostly so Asahi
and other drivers can test things under FEX without pushing patches to us for downstream support.
Add support for thunking Wayland
This doesn't affect most users today but adding support for thunking wayland means in the future applications that use this can sanely use this thunk.
SDL applications today might be able to take advantage of it but it is fairly fresh. We're looking forward to the inevitable Wayland and WINE
utilization to let things move away from X11.
Fixed 32-bit clock_nanosleep
There was a fairly nasty implementation detail where a 32-bit application trying to sleep with this syscall would actually consume a CPU core to 100%.
While fairly uncommon, this allows the game Alwa's Awakening to not burn a CPU core while running.
Add a bunch of functions to FEX's ARMEmitter
Not really a user facing feature but our code emitter has gained a bunch of new instruction support. This will be used in the future for our AVX2
implementation and various things. So it's good to have.