Skip to content

Releases: FEX-Emu/FEX

FEX-2303

06 Mar 17:24
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

Oh jeez, another month already? I guess it's time for another FEX-Emu release. Let's pick a commit, spin the roulette wheel, and hope for the best!
Surely that's how releases work?

Rootfs images are now on a new CDN!

While this is something that doesn't directly impact FEX when running applications, it's a problem that most of our users need to deal with when
installing FEX. Our previous CDN which was hosting our x86 images had a fair number of problems that couldn't be solved. The main issue that affected
users was that it was slow to download the images and depending where you were in the world, it could have an unstable connection. This resulted in
gigabyte sized files taking forever to download or never at all!

This month we have switched our CDN to a service that has worldwide data replication across multiple dataservers. This improves the speed in which
users can download our prebuilt images. Going from an average of 20MB/s to over 300MB/s is a significant boost. In addition to that, the connection is
significantly more stable to the far corners of the world. Also something that doesn't affect users at all is that this new CDN is actually
significantly lower cost than what we are currently using. This was unexpected but it's a nice bonus that this CDN is an improvement is every regard,
including cost.

This month's code changes

With that out of the way, onward to this month's changes.

Optimize REP STOS instruction in to inline memset

This is an instruction that x86 offers that behaves similarly to a memory set operation. It behaves slightly differently since this allows you to set
the memory by element size, and also you can choose to direction in which the memory is set. In particular this instruction tends to get used for
zeroing out memory. Latest x86 CPUs have even optimized this instruction in order to be fast as possible. Previously FEX had decomposed this instruction
in to a complex series of code blocks that was inefficient for our JIT and everything surrounding it. Now we instead convert this to a single IR
operation called MemSet which exposes the semantics of how the instruction works. Allowing our IR to be cleaner and the backend to decompose it in
a more optimal fashion. Currently we emit a a fairly trivial loop that handles this memory set operation. ARM has recently announced that future CPUs
are going to support a memory set instruction that is very similar to the 8-bit REP STOS which will make this implementation even faster!

As seen by this graph, FEX is no where near a native implementation. It's important to note that even without writing "optimal" codegen, this change
has still given FEX up to an 11% performance improvement on its implementation. This was primarily focused around improving the IR, we can now
optimize the code that the JIT emits significantly more easily! Getting closer to native is likely something to come in the
future.

Add config option hide hypervisor CPUID bit

We encountered the first game that has anti-virtual machine code and refuses to run if it thinks it is running in a VM. While FEX isn't a virtual
machine, we expose this CPUID bit so software that cares can use it as hint to query FEX specific CPUID information. Now that this game has stumbled
upon this issue, we added a configuration profile to disable this CPUID bit for the game. If any other games also pick up on this issue then we will
need more profiles.

Proton and pressure-vessel startup optimizations

One of this months efforts have been about improving the time it takes for Proton to startup. pressure-vessel is the project that is used to setup the
Proton execution environment which takes a while overall. One of the hardest things about Proton is that it executes thousands of programs and does an
absolute ton of filesystem accesses. ARM devices typically don't have the highest performance filesystems, which makes one part of this hard, but also
FEX's filesystem overlay adds overhead to this. Additionally one of FEX's shortcomings currently is that every application execution must JIT fresh
code every time it restarts. Since pressure-vessel starts so many programs, a lot of the time is just spent emitting code to memory. There were a few
optimizations that went towards making this faster this month.

With the couple of optimizations in place we managed to shave a second off of the start-up time. Cutting the execution from 9.7 seconds down to 8.7
seconds. Or in the case of running on an Apple M1, execution is now down to 7 seconds. Almost all of this time improvement comes from faster syscall
wrapping and the remaining CPU time is code JIT and execution. It'll only get faster in the future!

Fix a race condition with syscall emulation

While this is a fairly minor change, we fixed a race condition around system calls which would consistently cause crashes when Steam was starting up.
Every piece of work that improves stability just makes the whole emulation experience so much better and needs to be celebrated!

Signal frame improvements!

A significant problem with using FEX is the debugging experience when something breaks. We spent a good amount of time this month improving how FEX
sets up its signal frames when the guest application hits a fault. Since we weren't following traditional signal frame generation, tooling around
backtracing was broken in most cases. We have now reworked this so that libSegFault will now work to give FEX a backtrace of the application's
state when it crashes.

We will be shipping a new rootfs which includes x86 and x86-64 libraries for libSegFault so that if users want to debug a crashing application, they
can try and get a backtrace.

AVX work continues

Another month, another bunch of AVX work that has been implemented.

Instructions implemented

  • VPHSUBSW
  • VHSUBPD/VHSUBPS
  • VPERMILPD/VPERMILPS
  • VPERMD/VPERMPS
  • VPHADDSW
  • VPTEST
  • VPMOVSD/VPMOVSS
  • VSHUFPD/VSHUFPS
  • VPSHUFD/VPSHUFHW/VPSHUFLW
  • VPSHUFB
  • VPALIGNR
  • VEXTRACTF128/VEXTRACTI128
  • VPBLENDVB/VBLENDVPD/VBLENDVPS
  • VBLENDPD/VPBLENDW

As you can see a lot of new instructions are now implemented. This now leaves us with about thirty more instructions that need to be implemented
before we can start avertising the features on SVE2-256bit supporting hardware. This is significant as we keep finding more and more games that are
requiring AVX to run

ARM emitter cleanups

Another change that isn't user facing but is always nice to point out some janitorial tasks that have been done. When we switched over to using our
own code emitter there were some design choices and implementations that weren't quite optimal. This usually culminates as developer pain when using
the emitter but was a necessary evil since we wanted to get rid of VIXL's assembler as fast as possible. @Lioncache
spent some time this month cleaning up a lot of the dirty code in the emitter, in some cases making it slightly faster as well. This is always greatly
appreciated as it reduces maintenance burden when working in the JIT.

They also implemented an absolute ton of new instruction emitter functions which previously didn't exist. While we don't use these yet, we will likely
use them at some point which will make our lives easier in the future.

New development machines for our developers

Just recently a new Snapdragon laptop has gotten working OpenGL and Vulkan drivers up and running! We are gifting each of our developers one of these
great machines in order to ensure we have testing platforms for all the OpenGL 4, DXVK, and VKD3D applications we want to be running! Kudos to all the
developers that worked on bringing this hardware up so quickly!

Raw Changes

  • ARMEmitter
  • Tidy up some assertion handling (e7069f9)
  • Remove predicate implicit conversion operators (41731e2)
  • Make second sxtw parameter a WRegister (e71e3ec)
  • Remove implicit conversions from Register/XRegister/WRegister (378e069)
  • Remove predicate uint32_t conversion operators (e869b2f)
  • Remove most implicit conversion operators for vector register types (0f45318)
  • Make VRegister constructor explicit (21fbcef)
  • Handle sequential registers in lists nicer (ef02083)
  • Simplify size handling Advanced SIMD 3 different group (24904f4)
  • Simplify advanced SIMD copy (e65b429)
  • Centralize handling for unsigned offset load-stores (1832cc8)
  • Handle SVE Integer Compare - Scalars group (fe1faf9)
  • Finish off SVE Predicate Misc group (165db37)
  • Handle SVE partition break categories (4d65521)
  • Handle SVE integer compare with wide elements ca...
Read more

FEX-2302

04 Feb 02:35
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

This month certainly passed in the blink of an eye. A lot of good bug fixes this month as usual! Continue reading to find out more.

Fix incorrect operation for cache line clears

In emulating the CLFLUSH instruction, FEX was incorrectly using the wrong operation for clearing caches. We were accidentally using the CVAU operation instead of CIVAC.
While this is incorrect, it was hard to find anything that was actually affected by the wrong implementation. With Snapdragon's open source Vulkan driver implementing what is required for VKD3D,
it became evident from Vulkan tests that this was incorrectly implemented. Switching the implementation is easy and will let VKD3D run without hacks
when the required feature is finished.

Bug fixes to 64-bit x87 emulation

A big thanks to CallumDev for finding and fixing these latest bugs in FEX's less accurate x87 emulation. As a
reminder, x87 on original hardware operates using 80-bit float values. This is a feature that ARM doesn't natively support, so FEX needs to emulate
this using a software floating point library. We have a hack in our configuration to allow removing this software implementation and instead operate
using 64-bit double operations instead. This can significantly improve performance in some 32-bit games but introduce rendering artifacts.

This month there were many bug fixes:

  • ALU operations that consume integers converted to floats are fixed
  • Float comparison that also consumes 16-bit integers fixed
  • FPREM instruction no longer infinite looping

With these fixes in place, a large number of games now actually render correctly with this hack enabled. It will be interesting to see how well this
improves performance or batterty savings in 32-bit games!

More AVX instructions emulated

With one of FEX's developers taking some away time, this was a little less involved than the last couple of months.
There was still a handful of instructions implementation

  • VPBLENDD, VBLENDPS, and VPSRAVD

Additionally while these aren't AVX instruction, we also implemented the CLWB and CLFLUSHOPT instructions. These match their ARM equivalents so it was
mostly an easy implementation that applications can use if they want.

Fix copy and paste error in Arm64 JIT

While this is a fairly minor issue, we had a copy and paste error in FEX's register spilling code. This caused Steam to crash in certain situations,
so fixing this since the previous release helps users wanting to run that.

A bunch of minor optimizations

This month had a bunch of small optimizations around the entire project. Alone these are all quite minor but added together should result in a couple
percentage of CPU time removed from FEX's JIT.

  • Arm64 Dispatcher is slightly faster
  • CPUID emulation initialization is faster
  • Optimize File loading, improving config loading time
  • Frontend instruction decoder optimizations to be faster
  • Makes IR operations 1 byte smaller, improving memory usage
  • Inline IR constants optimization to reduce IR memory size

Fixing thunk symbol override fetching

FEX's thunks had an issue where if a library was loaded, we would only ever fetch relevant symbols from that library directly. While this worked for
our use case, it breaks when wanting to use MangoHud in OpenGL applications. Resolving this issue fixes most things that will override symbols with
LD_PRELOAD.

Update JEMalloc from 5.2.1 to 5.3.0

While this is a fairly minor change, this release on JEMalloc fixes some bugs and improves performance. Small but every performance improvement is
welcome.

Support for execveat with AT_EMPTY_PATH

This is an interesting feature where an application can be executed directly through a file descriptor instead of a filepath on disk. This is a fairly
simple idea but has some interesting edge cases that might be interesting to some people. To see the more technical information about implementing
this, check out the pull request.

Raw Changes

  • ARMEmitter

  • Handle integer add/subtract vectors (predicated) instruction class (9d33bba)

  • Handle RMIF, SETF8/SETF16 (a899f9f)

  • Handle SVE floating-point recursive reduction (1cda029)

  • Add a few missing instructions (2c9f99e)

  • Support helper for long address generation (f8d56a8)

  • Removes some warnings that cropped up (5fd8fdb)

  • Arm64

  • Merge two loads in to an LDP (a28039f)

  • Fixes incorrect operation for CacheLineClear (f8d92aa)

  • Use switch statement for op handlers instead of jump table (565ed45)

  • Fix SpillRegister C&P error (9c93c6f)

  • Fixes large offset spill slots (9acb513)

  • VectorOps

  • Clamp shift amount to esize-1 for VSShr (9a318ca)

  • ArmEmitter

  • Adds two more classes of ASIMD instructions (95e544c)

  • Adds three more classes of ASIMD instructions (81e0ac7)

  • CPUID

  • Optimize initialization (f614fc6)

  • Config

  • Fix relative execve applications. (65971ef)

  • ConstProp

  • Pool inline constants (1e90ebb)

  • Core

  • Adjust virtual memory size for 32-bit (7f6a620)

  • Dispatcher

  • Extract 64-bit signal frame save and restore (65b6b6d)

  • Fixes x86-64 SA_SIGINFO generation (8dae785)

  • ELFCodeLoader

  • Don't use std::random_device for RNG (f5e97f3)

  • Emitter

  • Remove unused header (90bcb8c)

  • External

  • Update JEMalloc to disable 16k pages (bbf9198)

  • Externals

  • Update jemalloc to 5.3.0 (9322e55)

  • F64

  • Fix integer immediates for add,mul,div,sub (c2325e1)

  • FEXCore

  • Fixup 32-bit signal handling (fa1193f)

  • FEXLoader

  • Adds support for execveat with AT_EMPTY_PATH (dcce9ad)

  • Build FEXInterpreter and FEXLoader independently (8974509)

  • FEXRootFSFetcher

  • Support option to auto select first distro (a7aeb4a)

  • FEXServer

  • Remove POLLREMOVE usage (d2d5282)

  • FileLoading

  • Optimize FileLoad (28dd946)

  • Frontend

  • Various optimizations (787b689)

  • Github

  • Add ARM emitter tests to CI (da88c68)

  • IR

  • Removes NumArgs member from IR ops (9403c66)

  • Remove HasDest member (f8e762f)

  • JitSymbols

  • Fixes file opening and writing (a486797)

  • Fixes a crash that can occur (34e1ba6)

  • Linux

  • Fixes shebang file execution (477d4b6)

  • MContext

  • Insert a stack cookie with assertions enabled (7664359)

  • OpDispatcher

  • Adds support for CLWB and CLFLUSHOPT (7be2e1a)

  • Fixes a few missing GPR/XMM helper usages (4aa984a)

  • OpcodeDispatcher

  • Handle VPBLENDD/VBLENDPS (62e6ada)

  • Handle VPSRAVD (fe79f61)

  • Scripts

  • Update InstallFEX.py rootfs links (df87042)

  • Syscalls

  • Fix out-of-bounds read when handling single-line shebang files (https:/...

Read more

FEX-2301

06 Jan 17:45
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

Happy new year! A new month brings a new release of FEX-Emu, bringing in the new year.

A large amount of work in this last month, showing that FEX-Emu isn't slowing down even through the holiday season.

AVX emulation work continues

An absolute ton of work landed this last month towards bringing up AVX emulation in this last month. In total there were around 185 new
AVX instructions implemented in FEX-Emu's backend this month. At this point it starts becoming easier to talk about the number of missing instructions
rather than what is implemented.

According to FEX-Emu's instruction decoder tables, we have around 60 more instructions to implement before we can start advertising the feature. Of course
with anything programming related, the last 10% is going to take the longest to implement.

A huge shoutout to @lioncash for smashing out these implementations so quickly. The amount of work going in to this is
extensive.

As a side-note for users looking forward to this feature. The implementation requires hardware that supports both SVE and SVE2 with a 256-bit register
width now. Which means that Fujitsu A64FX, Neoverse-V1, and all current consumer class Cortex chips are incapable of taking advantage of AVX once
complete. This is a future proofing implementation for when future hardware becomes available that supports what FEX-Emu needs.

Implement a new AArch64 code emitter

One thing that has been a stand out performance bottleneck has been how quickly FEX-Emu can emit AArch64 binary code to memory. The project that
FEX-Emu used for this is ARM/Linaro's project called vixl. This project is a suite of tools including assemblers,
simulators, and disassemblers and many open source projects do use this. This is a very nice project that eases the developer's burden when writing a
JIT that targets ARM devices. Sadly when profiling our code, it turns out that FEX-Emu spensd a decent amount of time inside of vixl code due to how
obtusely large it is. Even with Link-Time-Optimization enabled in our code, we can't reduce the overhead incurred from vixl sadly.

With this in mind, FEX-Emu decided to create its own AArch64 code emitter tailored to what the project needs, which is high performance and low
overhead.

As seen in the chart above, the percentage of time between how long it takes to emit code between Vixl and our new emitter is significant. With the
Cortex-X1 only taking 68.7% of the time, and a smaller Cortex-A55 only taking 60.2% of the time. The Cortex-A55 having more of a win is showcasing
that due to how much code vixl takes to emit code, it is effectively saturating the icache and
BTB of the poor little CPU core.

Only code emission performance isn't the only story that matters here though. We need to showcase how much of an improvement this has including the
rest of the translation from x86 code.

Although code emission is only a percentage of our total time spent when translating x86 code, this new emitter is having a fairly massive ~8%
reduction in time spent JITing. This will manifest as reduced stutters when users are running games and generally faster application execution for
short-lived applications.

We're not stopping there of course, look forward to the coming months as we spend more time optimizing our JIT so it runs even faster!

Initial 32-bit thunk support

A tricky feature that FEX-Emu does with its emulation is that it is translating 32-bit x86 applications to run inside of a 64-bit process space. This
is a hard problem to resolve which is why we don't currently support thunking of libraries when running 32-bit applications. This is the initial work
required to start supporting this use case.

While not wired up to any library currently, we are quickly working towards getting Vulkan and OpenGL wired up to this interface so we can accelerate
older 32-bit games.

Various JIT optimizations

There have been various JIT optimizations this month which will improve performance a small amount. These aren't benchmarked since the percentage
improvements are so small that it is likely to fall in to single digit noise.

Optimize inline syscall spilling

When FEX handles a syscall inline with our JIT, we were spilling all of our registers to memory. Now with this optimization correctly working we only
spill exactly what is required, making inline syscalls faster.

Optimize generic spilling and filling

When jumping out of the JIT to C code, we need to spill both general purpose registers and vector registers to the stack. With this optimization in place we now
generate roughly half the instructions necessary when doing so.

Optimize SVE register spilling and filling

While currently not utilized today, this cuts the number of instructions required for spilling SVE registers to a quarter. Should be quite nice for
future hardware.

Zip elements for PHSUB instructions

These horizontal vector instructions behave a little weirdly and our original JIT implementation wasn't quite optimal. Previously we were doing
explicit element inserts to combine the final result. Now we are using the AArch64 Zip instructions which are significantly more optimal.

Fix global application configurations

This was a bug where we accidentally broke applications configurations shipped with the fex-emu package. In particular this caused the steamwebhelper
to break. With this resolved, steam will work correctly again.

Fix misspelled library names in Thunks Database

While a fairly minor fix, this can have a profound impact on users that are using our thunking infrastructure. Our XCB thunks were incorrectly named,
which meant that if users were enabling XCB thunks independentally of Vulkan/GL, then they wouldn't have actually been enabled.
With this typo fixed then this won't be a concern.

Note that if Vulkan or GL thunks were enabled, then this wouldn't likely have been an issue since X11 would have loaded xcb independentally anyway.

Misc

There was a bunch more this month that was smaller and spread out. We don't want to take up too much of your time so if you want to see more, make
sure to check out the detailed change log!

Raw Changes

  • ARM64

  • Moves RA functions to header (048daa4)

  • Arm64

  • Rename GetSrcPair, GetDst, and GetSrc (bf7d0f7)

  • Enables debug option for disassembling the JIT code (03a0613)

  • Inline Syscall spill optimization (0ebb15c)

  • Optimize SVE register spilling and filling (1ab4471)

  • Optimizing spilling and filling (9a8852f)

  • Reduce dispatcher to 1 page (65e8bf9)

  • VectorOps

  • Simplify FADDP result merging (344ec33)

  • Config

  • Fixes global application configs (dc9737a)

  • Crypto

  • Explicitly clear upper lane with VPCLMULQDQ (4c013c8)

  • Dispatcher

  • Calculate REG_ERR correctly using ARM ESR_EL1 (4f313f5)

  • Frontend

  • Handle 256-bit destination sizes directly (e8aa79b)

  • IR

  • Handle 128-bit VInsElement with SVE (94ae2e3)

  • LookupCache

  • Use a PMR map for our Blocklinks with monotonic allocator (b7358b4)

  • Optimize cache clearing and allocation (2b6a020)

  • OpCodeDispatcher

  • Optimize a case of GOT calculation (b42b4e0)

  • OpcodeDispatcher

  • Handle immediate variants of VPERMILPD/VPERMILPS (3904a52)

  • Handle VMASKMOVDQU (c6297ed)

  • Handle VPHSUBD/VPHSUBW (4786ddc)

  • Zip elements instead of for loop insertion in PHSUB (58ec2b2)

  • Handle VDPPD/VDPPS (9b8c92e)

  • Handle VINSERTPS (6caf764)

  • Handle VMOVMSKPD/VMOVMSKPS (faa81f2)

  • Handle VPUNPCKHBW/VPUNPCKHWD/VPUNPCKHDQ/VPUNPCKHQDQ (64cd377)

  • Handle VUNPCKHPD/VUNPCKHPS (138f1fc)

  • Handle VPUNPCKLBW/VPUNPCKLWD/VPUNPCKLDQ/VPUNPCKLQDQ (6bc1c3f)

  • Handle VUNPCKLPD/VUNPCKLPS (4560c5b)

  • Handle VCVTSS2SI/VCVTTSS2SI/VCVTSD2SI/VCVTTSD2SI (4a884802f86...

Read more

FEX-2212

06 Dec 00:12
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

A lot of good work this month with the highlight being that we have started working on our AVX implementation and started optimizing our IR to be more efficient.

Disable PCLMUL if not supported on host

This carry-less multiplication instruction is only implemented on ARM SoCs that ship the cryptographic extension.
This extension is unsupported on the Raspberry pi which was causing applications that use openssl to crash.
Specifically this fixes Steam running on the Raspberry Pi again.

Adds 256-bit support to the remaining IR vector ops

A lot of work this month for implementing support for 256-bit operations.
With this work in place our JITs now support 256-bit for all of the IR operations.

Work started on AVX emulation

With the previous work completed for having our JITs support 256-bit operations, work could now be started on implementing AVX.
This AVX work is implemented as native SVE 256-bit operations, so the only hardware that can currently execute this partial implementation is Neoverse-V1 CPUs.
The expectation that as ARM CPUs become more powerful, they will eventually support SVE with 256-bit sized registers.
It may take a few generations to get hardware that supports this, if ARM CPUs want to run AVX games then they will need to support the equivalent hardware feature-set.

Current instructions implemented:

  • VZEROUPPER, VZEROALL
  • VMOVAPS, VMOVQ
  • VMOVNTDQ, VMOVNTDQA, VMOVNTPD, VMOVNTPS
  • VMOVDQA, VMOVDQU
  • VMOVAPD, VMOVUPD, VMOVUPS
  • VMOVLPD, VMOVLPS
  • VMOVSHDUP, VMOVSLDUP
  • VMOVHPD, VMOVHPS
  • VMOVDDUP
  • VORPD, VORPS, VPOR
  • VPXOR, VXORPD, VXORPS
  • VANDPD, VANDPS, VPAND, VANDNPD, VANDNPS, VPANDN
  • VADDPD, VADDPS, VPADDB, VPADDW, VPADDD, VPADDQ

This is just the beginning of us implementing support for this, stay tuned as we implement the remaining operations over the next few months.

Generate register access IR operations directly

As an original implementation design detail, FEX implemented GPR and XMM register accesses as a generic emulated CPU state access. Once we added
static register allocation we also added an optimization pass to convert these generic accesses in to register accesses which directly map to our
static register allocator.

This is a redundant pass since we know upfront which registers were being accessed. With this change we are generating register access IR operations
directly and removed the optimization pass. This removes around 12% JIT compilation time, which improves responsiveness and lets FEX spend less time
compiling code.

Systemd fixes

While this is a niche supported operation, some people may be interested in running FEXServer as a systemd client.
A FEXServer is meant to be a user-wide server that the FEX clients talk to for rootfs and eventually other management.
Using a systemd user service, a FEXServer can be started early, letting it mount the rootfs image, and run in the background.
This can be fairly useful as FEX error logs can then be printed to journalctl for inspection as for why a process has crashed.

Add support for steamid based configuration files

As an ongoing effort of documenting which applications can run with FEX's OpenGL and Vulkan thunk libraries, it was determined that some applications
use generic executable names. This means that a configuration file that uses the application name would have erroneously enabled thunks for other
untested applications.

In order to work around this issue, our configuration system now supports an optional steamid based naming convention for games that are launched from
Steam. With this in place, we now have a repository that contains application configurations that users can install at their leisure. This repository
can be found on Github

As part of the documentation process, all of these configurations must be documented on our Wiki with
testing results to ensure it works.

Implement SGDT

This is a quirky instruction that is emulated on a native x86 system these days. This instruction is a system instruction that is used by the OS for
getting the configuration of the global descriptor table. Linux captures this instruction and returns a configuration that says the table is living in
kernel memory space. While this is already true, an application usually doesn't need to care about this data.

Curiously enough Denuvo uses this instruction in some of their implementations for some reason. With us implementing this instruction, Denuvo games
now get slightly further before they horribly crash.

auxv fixes

When FEX executes an application, it needs to setup an emulated auxv state since this isn't a cross-architecture state.

  • AT_RANDOM
    • This now correctly passes through the host's AT_RANDOM value rather than fixed values
  • AT_PLATFORM
    • Some tooling uses this to determine if it is running as i686 or x86-64
  • AT_HWCAP/HWCAP2
    • This just returns some CPUID values, most applications use CPUID directly instead of this
  • AT_MINSIGSTKSZ
    • The minimum signal stack size is no longer being a hardcoded constant size
    • Applications are supposed to use this to calculate a signal stack size

Support radeon drm driver in ioctl emulation

Most Radeon GPUs these days use the amdgpu kernel driver, but a user found a hole in our ioctl emulation by using an old Radeon GPU on a Phytium ARM
board.

With this in-place, older Radeon cards that use the radeon kernel driver can now have accelerated OpenGL.

Misc optimizations

This month we have had a random smattering of optimizations that improve startup, shutdown, and execve performance. While not individually providing a
lot of benefit; small optimizations like these add up to make FEX better over time

  • Defer cpuinfo file initialization until first access
    • Improves startup time
  • Use tsl::robin_map for some internal maps
    • Improves JIT time, and some minor shutdown performance improvements
  • Disable multiblock by default
    • This causes excessive JIT overhead which makes the experience worse for the user
    • Significantly reduces stutters
  • Improve hot path of file existance checking in syscall wrapping
    • During our overlayfs handling, this can be hit quite hard during file accesses
    • Improves file IO in applications

Raw Changes

  • Arm64

  • Const on unmodified argument (9ca34ca)

  • Minor optimization in AESKEYGENASSIST (c1d118c)

  • Optimize Break IR op codegen (c7dd6ff)

  • VectorOps

  • Simplify VMov IR op on SVE (70e6ab5)

  • CMake

  • Fix typo in clang thunks option. (0030971)

  • Config

  • Disable multiblock by default (df25d4e)

  • Add support for steamid based configurations. (02ca94e)

  • Core

  • Replace a couple maps with tsl robin_map (57c5761)

  • Removes log about migrating to shared memory mode (8b6e9e0)

  • ELFCodeLoader

  • Calculate AT_MINSIGSTKSZ (e0fe916)

  • Fixes AT_PLATFORM null terminator (d7b0e84)

  • Pass through AT_SECURE (8afc3b8)

  • Ensure we set AT_SYSINFO for 32-bit (1d32df9)

  • EmulatedFiles

  • Defer cpuinfo file initialization to first access (8e2b0d1)

  • Externals

  • Update vixl submodule (f066abc)

  • FEXConfig

  • Sort named rootfs vector (71f658b)

  • FEXLoader

  • Make IsInterpreterInstalled check less horrible. (1dd5642)

  • Disables some AOT shutdown overhead when not enabled (f8b2a0b)

  • FEXServer

  • More Systemd fixes (5e5e5a3)

  • FEXServerClient

  • Disable confusing connection log (cc6306a)

  • Add some debug logs for when FEX can't connect to se… (3c8da3e)

  • IR

  • Handle 256-bit VExtr (5a403b7)

  • Removes the only uses of VSLI and VSRI (7d9ed4e)

  • Remove VLoadMemElement and VStoreMemElement (9cee012)

  • Handle 256-bit LoadRegister/StoreRegister (a9c5138)

  • Handle 256-bit VAddV (04d4c5e)

  • IntrusiveIRList

  • Add a utility helper for getting an OrderedNodeWrapper (3c88180)

  • IoctlEmu

  • Support radeon (https://github...

Read more

FEX-2211

03 Nov 07:44
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

A lot of good changes this month for our users. Both performance and compatibility improvements to be had!

Segment register index optimization

This optimization has been a long time coming. Sitting in pull-request limbo since back in April. This is an optimization to cache segment register
addresses so the JIT can more optimally generate memory accesses. While segment registers are mostly gone with x86-64, 32-bit segment registers are
used fairly commonly with some instructions completely implicitly. This just adds overhead to fetch the LDT and GDT entries for something that
typically doesn't change very quickly.

With this optimization in place, we get an average of 4.3% uplift in 32-bit Bytemark. This performance improvement will be directly felt when
running 32-bit applications.

48-bit Proton Experimental fixes

For a while now FEX has worked with Proton 7.0 and older, but we have had issues running Proton Experimental in some cases.
This was a tricky problem to nail down but we had some good leads. If your ARM device was running its kernel with 48-bit Virtual address space (VA) enabled then Proton
Experimental wouldn't work. On the other-hand if your kernel is compiled using a 36-bit VA then it would run fine. After a few days of debugging, it
turns out that Proton/Wine allocates the lowest 32MB of its stack space, and the kernel by default allocates a 128MB space for the application.

When an application is ran natively the stack is allocated at the fixed location in memory. FEX was failing to allocate the stack at the correct
location. When Wine's preloader eventually ran; FEX will have allocated JIT code at that fixed location, which Wine would then map over, zeroing the
memory and breaking the FEX JIT. The preloader has done this for a long time and it was by pure chance that we weren't breaking older versions of Wine
and Proton.

With this problem fixed in FEX, we are now able to run triple-A games on AArch64. Just like the following images of God of War running on Snapdragon
888.

Even more IR changes preparing for AVX emulation

Once again this month we have a absolute ton of commits from Lioncash working on making our JIT be ready for AVX emulation. Around 25 commits working
towards this, with only about four more IR vector operations to support AVX with.

Once the JITs support 256-bit operations, we can start working towards emulating the instructions themselves.

Fix thunk crashing due to insufficient stack space

When FEX starts we potentially need to allocate all memory inside of the 48-bit VA space to match how x86-64 only has 47-bits.
This intersects with our stack space allocation which is supposed to autogrow, but we allocated it instead. Now we give the full 128MB stack space to
FEX so it won't crash anymore.

Implements support for remaining BCD instructions

Thanks to @wannacu for implementing the remaining handful of 32-bit BCD instructions. DAA, DAS, AAA, AAS, AAM,
AAD
were all missing in FEX's implementation. While BCD is fairly uncommonly used these days, they still managed to find an application that uses
these instructions. With these implemented, FEX should have all of the BCD instructions finally implemented.

Implement gpuvis timeline profiler support

While not majorly important for users, this is a very good interface for developers wanting to watch why a game has stuttered and for how long code
took to compile. This lets us take advantage of the same interface that GPU profiling events are using to see why a game missed a vsync.

This isn't enabled by default out of concern for taking too much CPU time, so it needs to be enabled with the ENABLE_FEXCORE_PROFILER cmake
option.

Fix ROR OF flag calculation

This is a fairly minor bug since not many things rely on the OF flag specifically. But in our testing of new Proton games, we found out that Denuvo
Anti-Tamper is relying on this edge case behaviour and we messed it up. While this gets Denuvo running slightly farther, it still doesn't quite work
under FEX.

Fixes FPREM1 C2 flag calculation

FPREM1 will return a flag if the number was too large to calculate in one step. Which is usually not the case. Since we are calculating the full
remainder we will never set say we return a partial remainder. This solves an infinite loop in Mono applications that are using SIN/COS math
operations.

Claim X87 transcenental ops are in range

X87 will set a flag if a program tries to operate on a value that is out of range for trancendental SIN/COS/TAN operations.
FEX-Emu doesn't actually detect these for performance reasons, so instead claim these are always in range. While not always true, if they are out of
range then we weren't detecting them anyway. Fixes an issue where glibc would do some fixups to try and bring the value in range, resulting in invalid
results.

Add missing thunk library versions

This fixes an issue where FEX thunks would try to dlopen development libraries, which are missing on most user's devices.

Fixes indirect thunks with 8+ arguments

This fixes a quite bad crash with OpenGL and Vulkan thunking where every function with 8 or more arguments would be likely to break.
Fixes thunks for a bunch of games.

Add support for disabling thunks in application configurations

This is useful for narrowing down thunk compatibility issues in certain applications. While it is still not recommended to enable thunks globally,
this allows more flexibility with tinkering with it

Implements four more auxv values

FEX implements most of these values for applications to pull but in some cases we didn't have these setup. Specifically AT_PLATFORM is required
so ldconfig can work correctly. AT_HWCAP/AT_HWCAP2 is used for an application to check for CPU features, and AT_RANDOM is a 128-bit
random number that the kernel provides.

Misc

Quite a few more things that were changed this month, but this report has been going on long enough.

Raw Changes

  • 32bit

  • Fixes Debug build of VDSO (cf91ab9)

  • Allocator

  • Expand stack space when stealing virtual address space (000677a)

  • Arm64

  • BranchOps

  • Remove unused std::vector (74e18f4)

  • ConversionOps

  • Eliminate use of temporary in Vector_FToF (b1d98f4)

  • MemoryOps

  • Merge if statement into switch in ParanoidLoadMemTSO (199649b)

  • Remove lingering unnecessary ptrue instances (40d820f)

  • VectorOps

  • Make use of MOVPRFX where applicable (639d6e6)

  • Simplify SVE VSXTL/VSXTL2/VUXTL/VUXTL2 implementations (136f1e2)

  • ELFCodeLoader

  • Fixes Proton Experimental on 48-bit VA systems (2fa1a64)

  • Implement four more auxv values (fa5322d)

  • External

  • Update vixl submodule (fc6de5f)

  • FEXCore

  • Adds support for a timeline profiler interface (62a24bd)

  • FEXServer

  • Be robust against invalid packets. (64eb87e)

  • Flags

  • Refine _Bfe's shift (abb44d3)

  • IR

  • Handle 256-bit VInsElement (9e7daf6)

  • Handle 256-bit LoadContextIndexed/StoreContextIndexed (03f0edc)

  • Handle 256-bit StoreMem/StoreMemTSO/ParanoidStoreMemTSO (70a91ee)

  • Handle 256-bit LoadMem/LoadMemTSO/ParanoidLoadMemTSO (8a14f87)

  • Handle 256-bit LoadContext/StoreContext (d475b0b)

  • Handle 256-bit VTBL1 (2332c41)

  • Handle 256-bit VInsGPR (b7d9c00)

  • Handle 256-bit VExtractToGPR (b3ee5db)

  • Handle 256-bit Vector_FToI (7e81023)

  • Check for invalid conversion masks in Float_FromGPR_S (7291b10)

  • Handle 256-bit Vector_FtoF (b8f7e4c)

  • Handle 256-bit Vector_FToZS/Vector_FToS (13003da)

  • Handle 256-bit Vector_SToF (cb17ee9)

  • Handle 256-bit VDupElement (27b022d)

  • Handle 256-bit VUnZip/VUnZip2 (780e3c7)

  • Handle 256-bit VZip/VZip2 (ab45d...

Read more

FEX-2210

13 Oct 08:57
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

This month's release was a bit delayed due to the fact that most of FEX-Emu's developers were meeting up physically at the X.Org Developer's
Conference this year! Before we talk about this months changes we need to spend a bit of time talking about some cool things.

FEX-Emu XDC talk

This year FEX-Emu had a talk to discuss some of the weird interactions with Mesa in an emulated environment. You can see the full talk in the embedded
video.
XDC Talk

At the end of the video we showed a quick demo of (mostly!) Proton games running under FEX-Emu on a Snapdragon 888 device. You can see this demo
directly embedded below.
XDC Sizzle Reel

Ubuntu 22.04 Rootfs Mesa update

We have had to update the Ubuntu 22.04 rootfs image with a newer version of Mesa today. Unfortunately our last update with Mesa 22.2 had a bug in the
Raspberry Pi Vulkan driver which completely broke Vulkan on ALL devices, not just raspberry pi. We have updated the rootfs today with a mesa git
version of the library to work around this issue. As a benefit, this version of the FEX rootfs includes the new Venus Vulkan 1.3 driver which can be
useful for testing.

Pick up the latest rootfs with the FEXRootFSFetcher tool.

New Lenovo ThinkPad X13s Gen 1 laptops

Last month Lenovo launched a new Snapdragon laptop that is one of the best development platforms that FEX-Emu devs could ask for. This platform is
shipping the Snapdragon 8cx Gen 3 SoC which is one of Qualcomm's most powerful chips. The only downside with this platform currently is that the GPU
doesn't yet work under Linux. There is an ongoing community effort to get the GPU up and running but these Snapdragon chips typically take a while
before support is fully in-place.

Once the GPU works then this will be a perfect platform for testing Adreno with the Turnip Vulkan driver and Freedreno. At that point we will be
shipping out these laptops to all of our devs so we have a good Vulkan development platform.

Tweet from @FEX_Emu

FEX-2210

Although most of our developers were at XDC, there is no shortage of code that was merged this last month.

IR changes preparing for AVX emulation

This last month had at least 32 commits preparing our JITs for emulating AVX. While AVX isn't yet wired up, this is still a required step before it is supported. We are still requiring ARM SVE hardware that is shipping with 256-bit wide registers. This means the current consumer CPUs and just announced Neoverse-V2 won't work for our emulation here! This is future-proofing work since more games are requiring AVX to run but we'll just need to live with the problem that we will need new CPUs for the latest AAA games to run under FEX.

Support clang for thunks

We added support for building our thunks with clang this release. In particular the Ubuntu PPA is shipping this already. This might give a very minor perf increase but the main thing is removing a hard dependency on GCC.

Add uninstall cmake target

While it is generally advised to not install directly from source building, user tend to still do this.
It was asked multiple times to have an uninstall target so we finally added this convenience feature.

32-bit VDSO thunking support

This is FEX-Emu's first 32-bit thunk library! This exercises most of the thunking framework to bring this feature to 32-bit, without some of the harder parts that require data repacking. Now that this is proving that our 32-bit thunking is working, it is likely that we will start working towards getting the rest of the thunks supporting 32-bit as well!

IR cleanups

While this isn't directly user facing, this makes the JIT IR a bit easier to handle. Making the devs lives easier. We've removed redundant operations that aren't necessary.

Add support for vixl simulator in CI

While we are waiting for SVE-256bit hardware to get on the market, we need CI to prove that our implementation is correct. We have once again added the vixl simulator to our source tree.
The vixl simulator supports emulating the SVE instructions at whichever register width you want. While stacking emulators isn't good for performance, it is good for ensuring correct behaviour.
Sadly ARM's simulator doesn't emulate 100% of the operations correctly, we have had to disable a few of our unit tests in this case; but, it works well enough that it can pick up major mistakes.

CI functional testing

We have added functional testing of some of our thunks in our CI system. Specifically we are testing our OpenGL and Vulkan thunks to ensure they don't break. Since this is the beginning of functional testing, we currently only run vulkaninfo and glxinfo.
Soon we will be expanding this functional testing to encompass more features which will likely capture even more problems if they come up.

Map ELF files more like the kernel

The kernel has an interesting behaviour around how it maps ELF files in memory. It will always load the dynamic linker at around the highest address
it can. The primary ELF file will be loaded roughly in the middle of the address space with a bit of ASLR bias. We now emulate the same behaviour in
FEX to help with problems when running WINE. While not all the issues are sorted out, this is a good step towards making it more stable.

Fix LLVM ASAN

We had an issue with our ELF loading where LLVM ASAN was breaking due to mixing multiple mmaps in the same space. Simple bug with a simple fix. ASAN
all the things!

SMC deadlock fix

There was a fix to prevent a potential deadlock in our Self-Modifying-Code detection routines. Thanks to the developer that found this!

Lots of misc fixes this month

It would be hard to list all of the misc other fixes that happened this month. Find out more in our raw release notes!

Raw Changes

  • Arm64

  • Fixes SVE VectorImm (ad85268)

  • Centralize location for register defines (169cfbb)

  • VectorOps

  • Make use of static predicate registers (0fee355)

  • CI

  • Fixes struct verifier on Ubuntu 20.04 (f97a4af)

  • Adds support for flakes (96fecfd)

  • CMake

  • Add toolchain file for 32-bit cross-compiler (1ed3ecb)

  • Extend AArch64 check to include arm64 (a583ebe)

  • Docs

  • Update Release docs (c987e1e)

  • ELFCodeLoader

  • Map primary ELF more like the kernel (b44b340)

  • Fixes dynamic non-interpreter ELFs (edca528)

  • Map interpreter first (71f7ff5)

  • ELFCodeloader

  • Map once and then use MAP_FIXED to overwrite (d68b84b)

  • FEXConfig

  • Ensure APP_CONFIG_NAME isn't stored in json (8d69f53)

  • FEXLinuxTests

  • Adds missing pthread_cancel flake status (c262362)

  • Migrate to Catch2 (8f70137)

  • Build 32-bit and 64-bit test variants separately (3448c83)

  • Use the build system instead of setting up compile flags via source-code annotations (d213869)

  • FEXServer

  • Fix waiting on kernel version older than 5.3 (8f9d799)

  • FHU

  • Convert to a interface target (dee85f1)

  • IR

  • Handle 256-bit VSMul/VUMul (23dd056)

  • Handle 256-bit VRev64 (c412d07)

  • Handle 256-bit VShlI/VUShlI/VUShrI (c2b6aef)

  • Handle 256-bit VSShrS/VUShlS/VUShrS (51214d1)

  • Handle 256-bit VFCMPORD/VFCMPUNO (7b4b9a8)

  • Handle 256-bit VFCMPLT/VFCMPGT/VFCMPLE (25a8a00)

  • Handle 256-bit VFCMPEQ/VFCMPNEQ (a67f742)

  • Handle 256-bit VCMPGT/VCMPGTZ/VCMPLTZ (ed8150c)

  • Handle 256-bit VCMPEQ/VCMPEQZ (462a163)

  • Handle 256-bit VBSL (6374175)

  • Handle 256-bit VSMax/VUMax (8d8b029)

  • Handle 256-bit VSMin/VUMin (aa6a499)

  • Handle 256-bit VNot (64c4fdc)

  • Handle 256-bit VFNeg (d715ffb)

  • Handle 256-bit VNeg (1799d4c)

  • Handle 256-bit VFRSqrt (dacd96c)

  • Handle 256-bit VFSqrt (ca4d3bf)

  • Handle 256-bit VFRecp (ea38b04)

  • Handle 256-bit VFMax (a39746d)

  • Handle 256-bit VFMin (2367a8e)

  • Handle 256-bit VAddP (cb121d7)

  • Handle 256-bit VFDiv (50eba40)

  • Handle 256-bit VFMul (4472265)

  • Handle 256-bit VFSub (3f8b872)

  • Handle 256-bit VFAddP (e573ddc)

  • Handle 256-bit VFAdd (eedbde6)

  • Handle 256-bit VPopcount (4e441e5)

  • Handle 256-bit VAbs (3e287a3)

  • Removes Mov IR op (46bde40)

  • Removes VExtractElement (01beac4)

  • Removes unnecessary VBitcast IR op (fcd981e)

  • Removes SplatVector{2,4} (2b9cc96)

  • Removes VInsScalarElement (82eba22)

  • Interpreter

  • Handle 256-bit VSShr/VUShl/VUShr (4d6e15d)

  • Use constant for AVX register size where applicable (808e1c0)

  • Handle 256-bit VMov (412793c)

  • Handle 256-bit VAnd/VBic/VOr/VXor (b5cb429)

  • JITs

  • Handle spilling/filling 256-bit vectors (6742e0c)

  • Expand max spill slot size to 32 bytes (0d0d116)

  • SMC

  • Fix possible deadlock (8da9ebc)

  • Scripts

  • Updates DefinitionExtract (3977e1f)

  • StructVerifier

  • Fixes CI failure (d4b5bf0)

  • ThunkLibs

  • X11/Xext: Removes two functions that don't exist on 32-bit (cc4c705)

  • Thunks

  • Add support for building with clang (2b1ef97)

  • Adds dependency on linker script (eaddf7f)

  • Implement the Thunk IR op for 32-bit mode (1ea00f6)

  • Adds functional thunk testing to CI (a590977)

  • Host

  • Adds bool operator to fex_guest_function_ptr (3237de3)

  • gen

  • Use fmt for writing formatted output (704afed)

  • libvulkan

  • Fixes print for 32-bit (d8c2a82)

  • VDSO

  • Fix vsyscall (5cf5940)

  • VectorOps

  • Handle 256-bit VURAvg (977d6dd)

  • Handle 256-bit VUMinV (0261ed3)

  • Extend VSQAdd/VSQSub/VUQAdd/VUQSub (f34f130)

  • Extend VAdd/VSub (0ad52b7)

  • Misc

  • Add opencl thunk db (b693112)

  • 32-bit VDSO support (6f6f3c9)

  • Update vixl external (832a320)

  • Move thunk generator logic from A...

Read more

FEX-2209

05 Sep 20:12
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

A lot of miscellaneous work this month that isn't directly user facing. We do still have some interesting topics this month that some people will be
interested in.

Simplify StealMemory functions

A fairly significant change this month is reducing the time it takes FEX to set up its memory upon load. FEX needs to do an initial setup of the memory when an application loads
because between x86-64, x86, and AArch64 the memory layouts are significantly different.

Depending on the architecture of the application, FEX needs to allocate a large amount of memory to emulate the x86/x86-64 memory behaviour.

On 32-bit x86

  • We need to allocate all memory above 32-bit memory space
    • This is because we emulate 32-bit applications as a 64-bit AArch64 application

On 64-bit x86-64

  • We need to allocate all memory in the 48-bit virtual address space
    • This is because AArch64 supports the full 48-bit space for the user
    • x86-64 userspace only receives 47-bit
    • Application's rely on not receiving 48-bit pointers!

From this graph showing the amount of CPU time spent in each routine, we can see a significant reduction in time to execute.
For 32-bit and 64-bit specific operations this results in a ~70x and ~181x reduction in in execution time!

How well does this improve execution time in practice though?

This graph is showing the total time it takes to run applications fully through. The smallest test applications have shaved off around 75% - 85% their execution time. The biggest improvement
comes from Proton setting up its execution environment. Proton's underlying execution environment is called pressure-vessel which executes hundreds
of background applications while setting up. This is one of the worst cases for FEX since each independent application execution needs to JIT new code
and handle all of its state setup. This case reduces the execution time from around 21 seconds down to around 17 seconds! This can really
be felt when execution back to back Proton instances when testing games!

While this is a significant step in the right direction, FEX still has a ways to go to hit the native execution time of pressure-vessel which can take
as little as one second.

More AVX work

A bunch more work has gone in to supporting AVX emulation. This is still preliminary backend work for now.

  • HostRunner

    • Handle upper YMM lanes in sigsegv handler
  • InterpreterOps

    • Extend SSAData size to accomodate 256-bit operations
  • VectorOps

    • Extend VAnd/VBic/VOr/VXor
    • Extend VMov
    • Extend VectorImm
    • Extend VectorZero

Thunks

X11

Some fairly minor changes here that improve usability of thunks with Proton. We added more Xlibint functions to the thunks which fixes X11 thunking
with DXVK. X11 is required for both Vulkan and OpenGL thunking so having this working is necessary when running those games.

Another necessary change for supporting thunks with Wine/Proton is more aggressively supporting X11 functions which require variadic arguments. There
are quite a few of these functions sprinkled around that require this. While we supported these functions with open-coded support up to 7 arguments,
we need to support at least up to 14 arbitrary arguments in some instances. We now have some assembly code in place which can support an arbitrary
number of arguments by packing these in memory the expected way. While this only works for 64-bit integers, it's all that we need for X11.

With both of these features implemented both OpenGL and Vulkan thunking works with Proton.

VDSO

While this is implemented as a thunk on the FEX side, it behaves slightly differently that normal thunks. This will always be enabled as long as FEX
can load the VDSO-host.so library installed on the system. Due to the nature of VDSO, all applications always have a VDSO region provided by the
kernel at all times. FEX wants to provide fast emulation of this "library" since applications abuse it heavily for performance. This was noticed when
running Proton games, they abuse the clock_gettime very heavily which was causing significant CPU overhead. Applications were calling this VDSO
syscall hundreds or thousands of times a second. This now significantly lowers the amount of time spent in the kernel for timing functions.

getdents syscall emulation

AArch64 doesn't support this syscall but in most cases applications don't use it. This is because there is a much more modern syscall called
getdents64 that everything uses now. When running older compiled applications they are likely to use the classic syscall. Since AArch64 doesn't have
the classic version, we now emulate it entirely using getdents64, which fixes running applications from centos 7.

Misc

  • Fix compiling without jemalloc
    • Thunks are unsupported without jemalloc but we need to keep it compiling
  • Consolidate generated files to one file per platform
    • Nice code cleanup for developers
  • Minor cleanups for signature-based function pointer thunking
  • Support direct thunk config in configuration files
    • This improves the user experience with enabling thunks for application configurations
    • No need for two files to describe one thing now

Raw Changes

  • 64BitAllocator

  • Fixes a significant state tracking perf problem (123b672)

  • Allocator

  • Simplify StealMemory, make it less chatty with kernel space (04678f8)

  • Arm64

  • JIT

  • Rename CanUseSVE to HostSupportsSVE (7d8950d)

  • CI

  • Build Thunks (e544591)

  • FDUtils

  • Don't make unknown get_fdpath fatal (336dedb)

  • FEXRootFSFetcher

  • Fix crash if curl fails to download rootfs definition file (31fefaa)

  • FEXServer

  • Support socket path override (a2f4f49)

  • Github

  • Fix fresh runner rootfs checkout (d619968)

  • HostRunner

  • Handle upper YMM lanes in sigsegv handler (d5c83a2)

  • IRLoader

  • TestHarnessLoader

  • Don't build if not building tests (097184c)

  • InterpreterOps

  • Extend SSAData size to accomodate 256-bit operations (98dbfbe)

  • Linux

  • Emulate classic getdents syscall for x64 and x32 (9de25c2)

  • Syscalls

  • Use underscored shm syscall names (bbcca80)

  • Termux

  • Add android-shmem library (0adbe31)

  • Thunks

  • Consolidate all generated code to one file per library per platform (c17da25)

  • Adds VDSO thunk library (53623ff)

  • Minor cleanups for signature-based function pointer thunking (e6acdcc)

  • Support direct thunk config in configuration files (84a95ad)

  • Fix compile without jemalloc (d5138f5)

  • X11

  • Support Variadic stack packing (fbb008e)

  • Adds missing XLibint functions (998a3d8)

  • VectorOps

  • Extend VAnd/VBic/VOr/VXor (e776f4c)

  • Extend VMov (e7d7dd1)

  • Extend VectorImm (8439cf4)

  • Extend VectorZero (d03b6a9)

  • Misc

  • New domain. (7f9edbf)

  • x86_64/JIT: Resolve lingering fmt deprecation warning (37ccb13)

  • cmake

  • fix incorrect assumption about the value of git's core.abbrev (c03a7fd)

  • unittests

  • Support skipping unit tests based on host feature support (1fe6fc3)

  • ThunkLibs

  • Fix warning about "dangerous" use of tmpnam (12fee91)

FEX-2208

10 Aug 16:45
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

Some really exciting changes this month. Thunk stabilization out of the gate is a huge boon to tinkerers and a bunch of other things spread
throughout!

Thunk improvements

The amount of work to reach this point can't be understated. @neobrain has been putting forth a bunch of infrastructure
work over the past few months which hasn't been super visible to the end user. This month it culminated towards fixing a bunch of stability problems
with thunks.

One of the biggest problems ends up being when a pointer is passed between the guest and host through thunks. X11's XFree function is used for both
x86 specific pointers and AArch64 pointers. This ends up being a problem since FEX-Emu uses jemalloc for its allocator while a guest application is
highly likely to use glibc's allocator. Passing the opposite one to either will crash either. We now distinquish between a "Guest" pointer and a
"Host" pointer in our thunk of XFree, which significantly improves stability of X11 thunks.

We also now have our libGL thunk implicitly load libX11. Due to how FEX's thunks work, we don't pull in all of the library dependencies of a "real"
library like libGL. Long term FEX-Emu will likely want to link to the same libraries that the real x86-64 library would have linked to. For now, libGL
now relies on libX11 directly. This fixes an issue where libGL thunks wouldn't work at all for any game launched from Steam.

Then the big pull request that does an absolute wackload of infrastructure work to make things work, Implement signature-based thunking of function pointers. This pull request is a bit complex to explain but it allows function pointers to be marshalled
across the thunk architecture boundary safely. Allowing both x86-64 and AArch64 to call in to the other for code for whatever reason is necessary. I
would recommend reading the pull request itself because the information is quite dense there.

Then some very minor changes the fixes some edge case behaviour.

  • Make glXGetProcAddr of unknown address non-fatal
  • Make glXGetProcAddr querying itself work
  • Add some missing glX and GL functions that were missing.
    • Nearly everything supported now, just some minor things missing

The take away for this effort is that OpenGL thunking is now significantly more stable. We don't have a lot of games tested yet, but follow along at
our wiki for documenting which games support thunks under FEX-Emu.

When OpenGL or Vulkan thunks are enabled, games are dramatically sped up. It is the recommended way to play a game but we need more testing coverage
to ensure it is stable enough to use.

Fix edge-case instruction faulting behaviour

x86 has six instructions that explicitly fault in different ways. We weren't handling RIP setting correctly on some of these.

This was uncovered by running Elden Ring inside of FEX. With this bug fixed FEX-Emu can now run the game if Denuvo is disabled by renaming the
executables. Sadly it doesn't seem like Snapdragon can run this game yet.

AVX initial implementation details

@Lioncache has been working away at making this feature a reality. Newer games coming out are starting to require AVX
to run and FEX needs to support this. This is a necessary feature since we need to claim compatibility for any CPU feature that
games use. So far the list of games we found is few that require the feature but it is going to become more common. The latest generation of game
consoles will drive this feature over the next decade of releases.

This sets up the initial groundwork inside of FEX-Emu to support the feature, with instruction implementations coming in the next months.

Don't expect this to run on any of your ARM devices any time soon though. We are going to require SVE with 256-bit register width to expose it. All
current hardware either doesn't support SVE at all, or only supports a 128-bit register width. Look forward to future hardware that ships with this
feature.

FEXConfig quality of life improvements

This is cleaning up some of the rough edges inside of FEXConfig, making it easier to modify your config. Significantly less keypresses to open a
config! Perfect!

Fix SOMA again

Due to how FEX changed some of its threading logic, we broke SOMA abusing the SETXID signal. This is now resolved and the game runs under FEX-Emu
again.

Fix Static Register Allocation in signal handlers

Applications relying on signals would very likely have crashed for a while due to this regression. With this resolve these games are stable again.

FEXRootFSFetcher automation options

FEXRootFSFetcher now has command line options for automatically choosing distro image and setting it up in config. This is useful for containers
embedding FEX and wanting a fresh rootfs instead of shipping it. It also does a runtime check for any image tooling programs before executing, fixing
a spurious error message.

Fix hang in Proton from close_range syscall

Proton would sometimes hang due to a close_range syscall trying to close all file descriptors. Make FEX a bit smarter about closing FDs with this
syscall which fixed the bug.

FEXBash change PS1 description

When running FEXBash it can be confusing for new users how it is behaving. Now we show the current operating path, user, and FEXBash as a prefix.
Hopefully this can be less confusing to new users that FEXBash isn't a VM, or docker style chroot.

Support Linux 5.19 passthrough

Not much changed from FEX's perspective here. Some minor DRM changes that naturally work themselves out.

Only initialize perf map file if profiling is enabled

Sorry about filling up your /tmp folder with empty files. This has been resolved.

Add pidfd_open syscall wrapper for compiling on older Linux distributions

Thanks to @wannacu for this fix. Our testing of older Linux distributions is quite spotty since FEX-Emu only officially
supports back to glibc 2.31 distros. This adds a wrapper utility for this syscall so older versions of Linux can compile FEX. Your mileage may vary
since FEX-Emu doesn't officially support it although.

Developer specific improvements

Remove static-pie

Static pie will never work due to glibc limitations around dlopen that FEX needs for thunks. Remove the option entirely to ensure no one tries to use
it.

FMT updated to 9.0.0

Newest release is best release

Bitness of syscall handler improvements

We were setting up both 32-bit and 64-bit syscall handlers upon initialization. Now we only initialize one or the other. Saves a bit of memory and
startup time shaves some microseconds off.

Vulkan-Headers included in External

It's no longer required as a install dependency on the host. We need to handle all Vulkan signatures, not just what is available on the host.
Makes compiling Vulkan thunks slightly less painful.

Misc

  • Cortex-X1C supported in CPUID
  • Support a globally installed Config file
  • Support installing many json files from FEX Data folder
  • Disable UnitTestGenerator since it is unused
  • Assume optimizing LogManager assertion functions
  • Support executable names being picked up for wineserver.
    • Useful to see if a Windows game is doing something that our telemetry picks up
  • Removing last usage of raw IR arguments in emulation backends
    • x86_64: Migrate args over to named IR arguments
    • json_ir_generator: Remove Args() functions from IR structs
    • These make it a lot easier to see what arguments are being used and why

Raw Changes

  • AppConfig

  • Fix bug with filename (ac23bce)

  • Arm64

  • JIT

  • Remove unnecessary [[maybe_unused]] attributes (89aa590)

  • Arm64Dispatcher

  • Amend memcpy in SpillSRA (f5e18cc)

  • Arm64Emitter

  • Re-add use of stp/ldp with hosts that don't support SVE2 (8589119)

  • CMake

  • Support multiple json files in the root of Data/ (bd296d7)

  • CPUID

  • Detect Cortex-X1C (c9f0ecb)

  • Config

  • Support a global configuration file (3a64ea1)

  • Dispatcher

  • Fix SRA enabled check in signal delegator handlers (0dfe617)

  • Externals

  • Update fmt to 9.0.0 (54f62b6)

  • FEXBash

  • Changes PS1 to hopefully help users (a72ebfd)

  • FEXConfig

  • Some quality of life improvements (164299c)

  • FEXCore

  • Fix-up edge case behaviour on faulting instructions (8037231)

  • Adds assume optimizing LogManager function (3d347ed)

  • Support synchronizing RIP on block entry through config (80909ea)

  • FEXRootFSFetcher

  • Adds runtime checks for image mounting tools (cfd59db)

  • Actually wire up -a -x (3aabe077...

Read more

FEX-2207

07 Jul 18:26
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

This is going to be a very interesting release this month for users. Quite a large number of features landed for this release!

Automatic TSO mode migration

When FEX is running a single threaded application, we can be optimistic and disable heavy TSO-emulation related features. This significantly speeds up some single threaded applications. Once the program creates a thread then FEX will disable this optimize and clear its code cache to be safe.

EroFS rootfs image support

While FEX has supports SquashFS for a long time. We are now adding support for EroFS as well. The big advantage of EroFS is that it doesn't serialize accesses to a single thread. When you're having dozens of threads accessing the filesystem this is a real bottleneck. Low end devices would end up having a single CPU core maxed out inside of the squashfuse application while multiple threads are trying to request data.

erofsfuse solves this by allowing multi-threaded decompression that scales quite well depending on the number of file requests in flight. We can see how this scales in the following benchmark graphs.

As one can see, while erofsfuse scales quite well with multiple threads; squashfuse stays pretty much flat the entire time. The downsides to EroFS is that the compression ratio of its LZ4HC compression isn't quite as good as ZSTD, causing the rootfs to be larger. But the reduction in memory usage, and lower read amplification plus higher bandwidth is worth it. Seriously improving performance of using a rootfs over a network mapped share like some people do.

An additional problem is that the erofsfuse application requires erofs-utils version 1.5, which came out on 2022-06-13. This is really bleeding edge currently.

Never the less, FEXRootFSFetcher will now allow you to download a FEX Rootfs image with this compression format. Just ensure you have a erofsfuse installed.

FEXServer

This is a fairly significant change to how FEX-Emu operates in the background. Similar to how wine has a wineserver, FEX is now requiring a FEXServer
to always be running.

For now the FEXServer is taking over duty for rootfs image mounting and a logging server. In this future this will be expanded to also handle code
caching services and more. FEXServer will automatically start on invocation of FEX and be running in the background until all instances of FEX close.

Pressure-vessel and Proton Fixes

FEX-Emu now officially works inside of pressure-vessel. This is the tool that Steam uses for running Proton games. Thunking doesn't yet work in this case but it is coming.

If you're wanting to test proton games, make sure to sign up to the latest SteamLinuxRuntime_soldier beta in the settings and give it a go. It's not currently the speediest, but it should work.

Disable FEXServer rootfs when running under pressure-vessel

Pressure-vessel sets up an x86-64 rootfs. FEX shouldn't be using the FEXServer provided rootfs in this case.
We now detect when running inside of pressure-vessel, and disable the FEXServer RootFS

Enable Hypervisor bit

This change allows pressure-vessel to detect FEX-Emu and do FEX specific setup for games.

Fix open syscall path emulation

The open syscall is fairly rarely used so this has gone unnoticed for a while. We weren't wrapping this syscall in our filesystem emulation and was breaking applications from running. With this fixed, the latest Proton Experimental branch from Steam now works!

Support thunks in pressure-vessel

Pressure-vessel uses a bunch of environment variable overriding to replace where libraries are inside of its chroot. Support this inside of FEX. While this is a step to getting Thunks working inside of pressure vessel, it is not yet supported.

Thunks

Lots of improvements to thunks, it's hard to capture them all. There is a heavy amount of infrastructure work going on in here to make thunks more robust and stable. Starting with Vulkan and GL.

  • Work around lack of generic callback support in VK_EXT_debug_report (4771a34)
    • Disable debug report callback (751b66d)
  • Allow building thunks on a wider range of platforms (ad6fd5a)
  • Add fex:is_lib_loaded (88b94be)
  • Support returning host function pointers to the guest (04a1ac9)

Fix clone3 syscall's stack pointer again

In an edge case of how FEX-Emu handles clone3, it wasn't handling the stack pointer size correctly again.
Resolving this edge case once again gets Steam's web helper working with glibc 2.34.

Fix 32-bit memory allocation range scanning

When scanning for free chunks of memory in the 32-bit range, FEX-Emu needs to use a custom allocator to ensure everything returned ends up in the lower 32-bit memory space. This fixes a bug where large allocations would never find an empty space. Fixes X-Plane 11!

Optimize file descriptor to filename mapping

It is a common occurrence that FEX needs to map an open file descriptor back to a file path. This used to take 14 system calls.
Since each system call was querying filesystem metadata these could take some time. With this optimized approach it now takes only one system call instead. Significantly lowering file IO overhead!

Enable Wine application profiles

Wine applications when they are executing typically only showed up as wine or wine-preloader to FEX-Emu.
Now we work around this issue by scanning the arguments to find the executable name, which allows application profiles to function.
Now we can easily support SonicMania.exe.json!

FEXRootFSFetcher fix to file hashing

It was discovered that this tool was hashing files incorrectly. The new version is now hashing correctly and image files have been updated to be using the new hash. Nothing to see here

Fix 32-bit DRM ioctl DRM_IOCTL_WAIT_VBLANK

This ioctl does exactly what it says on the tin. Due to a copy and paste error, this wasn't actually waiting on vblank.

Fix 32-bit ioctl structure copying

A feature of the DRM subsystem allows you to extend ioctl struct definitions safely. The kernel knows the size of the ioctl structure and if it
differs from what the userspace application passes in, then it will only copy the smaller amount of data and zero out the rest.
This allows older userspaces to safely work with newer kernels. FEX wasn't reproducing this with its ioctl emulation in some V3D ioctls, resulting in unsafe execution of ioctls. This has been resolved.

Support CLMUL Extension

This instruction is heavily used to accelerate CRC and other hashing algorithms. This perfectly matches the AArch64 instruction as well. So
implementing this was very straightforward!

Self-modifying-code frontend improvements

Allows FEX to track code pages inside our frontend decoding. This fixes some issues where code can be changed while we are decoding things in the frontend. Now FEX can detect this and throw away what it compiled.

Developer specific improvements

Check for binfmt_misc conflict before installing

To ensure building from source doesn't result in a broken configuration, cmake will now check for conflicting binfmt_misc files before installing.
How to uninstall the conflicting binfmt_misc files is specific to how the user has installed them, so it is left up to them to find out how.

Auto CI fetching

If the CI systems need an updated rootfs, the config can now be updated and they will fetch the latest.

unittests now longer forever recompiler

ASM unittests would always reglob on building which took time. This is now fixed

Fix ASAN bug in how register allocation data was allocated

This was hard to track, finally this annoying bug that has gone back and forth a bit has been resolved!

ARM64 CPU feature detection for ASM unit tests

Automatically disables some incompatible unit tests on ARM64 devices that don't support some features. No more confusing failures.

GDB integration

This allows a plugin to be loaded in GDB to show more information that we would otherwise have. Giving us both backtraces and source inside of GDB
even through the JIT. Should let debugging the JIT be that much easier.

Raw Changes

  • AOTIR

  • Fix IRList delete (fb41ba1)

  • Fix RAData free (9242e59)

  • Arm64

  • EncryptionOps

  • Fix register specifiers in PCLMUL movs (63b70ff)

  • JIT

  • Use IR names in opcode implementations (19b0a9c)

  • Backends

  • Unified dispatch, interface rework, cleanups (072690a)

  • CI

  • Auto rootfs fetching (c027ace)

  • CMAKE

  • Create directories during configuration, fixes endless generation of unittests (e62bc24)

  • CMake

  • Check for binfmt_misc conflicts before install (6d2f98a)

  • CPUID

  • Enable the hypervisor bit (da8dbf1)

  • Common

  • Support application profiles for games launched through wine (3913dd6)

  • Config

  • Fixes AppConfig for wine-preloader (ae6a57e)

  • Context

  • Fix CreateThread partial initialization issue (eac579f)

  • Decouple from CodeLoader, introduce generic CustomIREntrypoints (https://git...

Read more

FEX-2206

04 Jun 22:46
Compare
Choose a tag to compare

Read the blog post at FEX-Emu's Site!

Quite a large amount of changes this month since we cancelled last month's release.

Steam's webhelper working again

Steam started enabling the chromium sandbox. Seccomp isn't supported in FEX-Emu so it was crashing early on.
Forcibly disable it trying to use the sandbox using an application profile.
This lets the game library be visible again, although it can take a while to appear.

Fix LRCPC and add support for LRCPC2

There was a bug in our CMPXCHG implementation that wasn't using ARM's acquire-release semantics accidentally.
Fixing this bug allowed us to reenable our TSO emulation using LRCPC.
Additionally we have added support for LRCPC2 which gives us some immediate encoded instructions to further reduce overhead.

On hardware that supports LRCPC these can result in a reasonable performance uplift.

SHA-1 and SHA-256 instructions implemented

These SHA instruction have been implemented and the CPUID bit is now exposed.
This is a GPR based implementation, an implementation using AArch64's equivalent SHA instructions will be implemented at a later time.

Self-modifying code support improvements

Many things have changed with supporting self-modifying code in a more extensive fashion.
FEX-Emu will now tracking guest allocations of executable memory and when the code has been modified, we will clear the JIT caches.
This happens for both true self-modifying code and also libraries being loaded.
Fault handling is employed to know when code is modified in memory to ensure we can tracak changes.
This is a new setting in FEXConfig called mtrack. The older syscall only tracking path is deprecated but still available for testing.

Option to emulate x87 with 64-bit float operations

Big shout out to CallumDev for implementing this long awaited feature.

A major performance problem of emulating x86 is any older game will be compiled to use the x87 extension. This is especially true for 32-bit games.
The problem with this extension is that by default it uses 80-bit floats, which AArch64 doesn't support.
We end up emulating this entire extension using a soft-float implementaiton, which while being quite accurate, is obscenely slow.

This performance hack is now available to remove a significant amount of the overhead by operating x87 instructions using 64-bit float scalar
operations instead.
This is known to be inaccurate, but most Windows games will actually be configuring the x87 unit to be lower precision than 80-bit.
Additionally most games don't actually need the extra precision that 80-bit provides, so it is usually safe to emulate it more inaccurately.

This may still have some bugs, we know at least one game that has issues that aren't explained by pure precision problems. The feature can be enabled
in FEXConfig under the Hacks tab, look for "X87 Reduced Precision"

Clone3 syscall fixed

With Glibc 2.34 released, this project has started using the clone3 syscall for creating threads.
FEX's implementation was mostly untested which resulted in all applications breaking.
Stack pointer behaviour was broken and now with this fixed, glibc 2.34 now works out of the box.

FEXRootFSFetcher don't try to continue download

FEX-Emu's CDN doesn't support continuing file downloads. Disable to not cause issues.

FEXCore: Reclaimable thread pool allocator

FEXCore now uses an intrusive pooling allocator to allow sleeping threads to give back memory to the pool.
This allows multiple threads to share a memory resource, reducing memory usage by a significant amount if an application has a bunch of sleeping
threads.

FEXBash: Set PS1 environment variable to show running under emulation

Once running FEXBash it can be hard to tell if you're running your bash terminal under emulation.
Setting PS1 to FEXBash> makes it easier to tell that the terminal is running under emulation.

FEXBash> uname -a

Linux ryanh-TR2 5.17.5 #FEX-2206 SMP Jun 4 2022 15:11:07 x86_64 x86_64 x86_64 GNU/Linux

OpcodeDispatcher: Fixes PEXTRB

Newer Unreal engine releases were generating a PEXTRB instruction that our frontend decoder was decoding incorrectly.
Typically this would result in a crash.
This fixes both Dirt 4 and Psychonauts 2.

Misc

  • CMake
    • Add support for mold
    • Add flag for defined signed overflow handling
  • Arm64: Optimize constant generation with ADRP+ADR
  • EmulatedFiles: Fixes temporary file generation flags
  • Struct Verifier: Fixes some bugs with DRM headers not getting picked up
  • Linux v5.17 and v5.18 support
  • JIT: Code relocation support
  • OpcodeDispatcher:
    • Adds support for non-temporal loadstores
    • Implements support for PAUSE instruction
  • Syscalls:
    • 32-bit mmap syscalls fixes
      • Has been broken since the start, most applications use mmap2 instead
      • Fixes Kega Fusion
  • CompileService: Removed since it is no longer required
    • We no longer try to compile in a reentrant safe fashion
  • JITSymbols: Cleaner printing of RIP relative to a file
  • Standard TODO markers for code searching
  • Some 32-bit FS/GS writing fixes
    • Not really used so didn't affect anything

Raw Changes

FEX Release FEX-2206

  • AOTIR

  • copy RAData and IRList, make sure data is accessible (da2e44d)

  • AppConfig

  • Inject --no-sandbox in to steamwebhelper (c14c0c2)

  • ArchHelpers

  • Adds relocation struct defines (b5ae9e4)

  • Arm64

  • Fix LDAPUR/STLUR DMB backpatch (27f2e0b)

  • Adds support for RCPC2 extension (f8ba373)

  • Fixes AtomicSwap (70988cc)

  • Arm64Emitter

  • Optimize constants with ADRP and ADR (912dbfe)

  • CMake

  • C/C++ flags for defined singed overflow warping (2e05349)

  • Add option to use the mold linker (5884114)

  • CompileService

  • Removes no longer necessary service thread (b1033ed)

  • Config

  • Adds code cache config option (278ca52)

  • Core

  • Adds Code Object Cache service (13f3c6e)

  • context-wide guest code invalidations (d810988)

  • EmulatedFiles

  • Fixes temporary file flags (4fbc266)

  • F64

  • Implement FCW using host rounding mode (db3854e)

  • Fix FILD and FIST for Size < 8 (89d6752)

  • FEXBash

  • Set PS1 to make it more obvious when running under FEX (ec38d58)

  • FEXCore

  • Adds refcount_shared_mutex class (1e597bf)

  • Reclaimable thread pool allocator (8a7f395)

  • FEXLoader

  • Fix create_directories check for aotir .path file writting (90f338d)

  • FEXLogServer

  • Stop improper use of std::erase_if (d523b7a)

  • FEXRootFSFetcher

  • Don't continue download (fa87c73)

  • JitSymbols

  • Print file+offset if possible (a715627)

  • Linux

  • Fixes 32-bit mmap (3fd136b)

  • MemAllocator32Bit

  • Add missing lock to shmdt, fix error returns (b2b4c2b)

  • OpcodeDispatcher

  • Implement SHA256 instructions (3bbff8a)

  • Handle SHA-1 instructions (8dd9a5b)

  • Implements support for PAUSE (da48020)

  • Fixes pextrb with high registers (fe11bd2)

  • Remove debugging dump statement (c8dc663)

  • Adds support for non-temporal loadstores (ba78dff)

  • ScopedSignalMask

  • Add shared mutex support, move constructors (8e36f53)

  • Syscalls

  • Fixes clone3 stack pointer (0ed9654)

  • Linux

  • Add guest[Mmap/Munmap] (b9d878b)

  • Refactor guest mman tracking (ce0f5db)

  • TestHarnessRunner

  • Use guest mapper for test harness files (b78af2f...

Read more