Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using userfaultfd in page guard manager #1161

Merged
merged 7 commits into from
Mar 24, 2024

Conversation

panos-lunarg
Copy link
Contributor

No description provided.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 22585.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2879 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 22599.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 22605.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2880 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 22625.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2881 running.

@per-mathisen-arm
Copy link
Contributor

Exciting! Any data on the performance of this solution compared with mprotect? Why is it not enabled for Android?

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 22646.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2882 running.

@panos-lunarg
Copy link
Contributor Author

panos-lunarg commented Jun 23, 2023

@per-mathisen-arm
This is still experimental. It's not enabled on Android because I haven't managed to get it to work there properly yet.

I am suspecting that this simply can't work on Android because ART is already using userfaultfd in order to do garbage collection and it is possible that the kernel reports all events to one fd. So all reports are going to ART instead of GFXR's handler.

All I am seeing is, when run in an Android app, the faulting thread never get's halted and the fd GFXR creates never gets any notifications.

I kinda verified this theory by creating multiple fds on a test app on Linux. Only the first one gets the notifications, while the rest poll() just times out

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2882 failed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 8.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2884 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build queued with queue ID 29.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 211 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2884 failed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 211 failed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 777.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2887 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 2887 failed.

UffdUnregisterMemory(memory_info->shadow_memory, memory_info->shadow_range);

void* shadow_memory = memory_info->shadow_memory;
if (munmap(memory_info->shadow_memory, memory_info->shadow_range))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you are supposed to unmap the memory first here. This introduces a race condition between the unmap and the new mmap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of that and I am dealing with random segfaults that may be explained by this?
Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just remove the unmap call. Redoing the mmap call with MAP_FIXED should work and throw out the previously mapped in pages.

Copy link
Contributor Author

@panos-lunarg panos-lunarg Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you're right, the random segfaults are gone!

However I am seeing random pages that although they trigger a pagefault event, when populating them ioctl/uffdio_copy triggers a File exists error. Indeed checking beforehand the pages with mincore these pages are already allocated.

I am not sure if these are responsible for the random visual artifacts (randomly corrupted textures and meshes) I am observing.
Also noting that I don't see neither the already allocated pages not the visual artifacts when trying this on a desktop game (recapturing a trace of a commercial title)

Copy link
Contributor Author

@panos-lunarg panos-lunarg Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also on desktop the munmap + mmap combination also seems to work.
Perhaps there's a difference in the strategy the two OSes allocate pages (or I'm just comparing apples to oranges)

{
if (msg[i].event == UFFD_EVENT_PAGEFAULT)
{
UffdHandleFault(uffd_fd_, msg[i].arg.pagefault.address, msg[i].arg.pagefault.flags);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you handle multiple events here, you need to use the DONTWAKE flags in the mode field for all but the latest event, otherwise you will get more race conditions, as the calling code wakes up and can do reads/writes that we presumably cannot catch while still processing more events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3862 passed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build queued with queue ID 150384.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 330 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build queued with queue ID 150989.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 331 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 331 failed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3859 failed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 154672.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3897 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3897 passed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build queued with queue ID 155485.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 336 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 336 passed.

panos-lunarg and others added 7 commits March 23, 2024 17:56
Introducing an alternative memory tracking mode which utilizes the
userfaultfd mechanism provided by the Linux kernel. This new mode should
provide an alternative for Linux and Android applications that are not
possible to be captured with the page_guard mode because they interfere
with the SIGSEGV handler. The new mode is enabled by seeting
GFXRECON_MEMORY_TRACKING_MODE/ebug.gfxrecon.memory_tracking_mode to
uffd.
Converted the constructor with the default values into a delegating
constructor
Also use uint64_t for thread ids instead of pid_t
@ci-tester-lunarg
Copy link

CI gfxreconstruct build queued with queue ID 155807.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3908 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct build # 3908 passed.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build queued with queue ID 155851.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 337 running.

@ci-tester-lunarg
Copy link

CI gfxreconstruct-extended build # 337 passed.

@ziga-lunarg ziga-lunarg merged commit 2d37240 into LunarG:dev Mar 24, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Prevents an important capture from being replayed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants