Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES true by default #1228

Closed
bradgrantham-lunarg opened this issue Aug 16, 2023 · 10 comments
Closed
Assignees
Labels
P1 Prevents an important capture from being replayed

Comments

@bradgrantham-lunarg
Copy link
Contributor

If the option GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES is not set, is it really possible to get corruption? Or has PageGuard tracking improved so that we don't have missing updates for overlapping allocations? I don't know enough about the implementation to say. If so, shouldn't we make GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES the default?

@panos-lunarg
Copy link
Contributor

is it really possible to get corruption?

Yes.
Imagine you have two buffers, let's say one is an mvp matrix that you update each frame from the host.
The other one is a single u32 value that the get updated by the gpu and the hosts reads it every frame.
Lets say the application binds those two buffer next to each other in the same device memory. One is 64 bytes and the other one is 4 bytes. Since GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES is false, depending on the driver, the buffers can actually be bound right next to each other, without any space between them. First you update the mvp and then you read the u32. That read will actually memcpy a whole page from the mapped memory overwriting the mvp.

Setting GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES to true will "fool" the application that each buffer needs a whole page of memory so the two buffers will have enough space between them.

Or has PageGuard tracking improved so that we don't have missing updates for overlapping allocations?

Page guard sees mappings and is not aware what's inside those mappings. So I don't think it's going to be some straightforward change to make page guard aware of that.

I guess the logic behind having GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES false by default is so that the application's commands are recorded with as little modification as possible.

I understand the default false is a source of problems with the CI which leads to a lot of wasted time. I agree with setting this to true by default.

@davidlunarg
Copy link
Contributor

If an app creates a lot of small buffers, having a default value of true for GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES would result in more buffer memory usage. If an app creates lots of small buffers, each one would be a full page in size, whereas if it were set to false the buffers would all be compacted together in adjacent memory with many on the same page. Setting it to false results in less buffer memory usage. But setting it to true results in correct captures. I think correct is more important than efficient.

Is it the case that a more descriptive name for this variable might be GFXRECON_PAGE_GUARD_ONE_BUFFER_PER PAGE? (Not that I'm proposing a name change, just trying to clarify in my mind how the variable affects behavior.)

@panos-lunarg
Copy link
Contributor

No I think GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES is the right name. When true the adjusting is done when a vkGet(Device)[Buffer|Image]MemoryRequirements(2) is called and it adjusts the size and alignment reported by these functions by rounding them up to the next multiple of the system's page size.

For example:

void PostProcess_vkGetBufferMemoryRequirements(VkDevice              device,
                                                   VkBuffer              buffer,
                                                   VkMemoryRequirements* pMemoryRequirements)
    {
        GFXRECON_UNREFERENCED_PARAMETER(device);
        GFXRECON_UNREFERENCED_PARAMETER(buffer);

        if ((GetMemoryTrackingMode() == CaptureSettings::MemoryTrackingMode::kPageGuard) &&
            GetPageGuardAlignBufferSizes() && (pMemoryRequirements != nullptr))
        {
            util::PageGuardManager* manager = util::PageGuardManager::Get();
            assert(manager != nullptr);

            GFXRECON_CHECK_CONVERSION_DATA_LOSS(size_t, pMemoryRequirements->size);
            GFXRECON_CHECK_CONVERSION_DATA_LOSS(size_t, pMemoryRequirements->alignment);

            pMemoryRequirements->size = manager->GetAlignedSize(static_cast<size_t>(pMemoryRequirements->size));
            pMemoryRequirements->alignment =
                manager->GetAlignedSize(static_cast<size_t>(pMemoryRequirements->alignment));
        }
    }

@per-mathisen-arm
Copy link
Contributor

FWIW, I support having this on by default. I would propose naming it GFXRECON_PAGE_GUARD_OBJECT_PAGE_ALIGNMENT though, since it isn't just about buffers and the point is the page alignment, not the padding in itself. And do you really need to modify the size parameter as well as the alignment parameter?

@bradgrantham-lunarg
Copy link
Contributor Author

After thinking about it, I'm a little concerned that making ALIGN_BUFFER_SIZES the default will hide the application's original intent, which was to pack buffers tightly. Being able to capture that in the file encodes the application's intent, and so if the application's behavior precipitates a driver or hardware bug then the capture file won't be able to reproduce it and someone analyzing the capture file won't see it.

That read will actually memcpy a whole page from the mapped memory overwriting the mvp.

@panos-lunarg Do you think we could work around that somehow? Like mark buffers that share pages so that we repair the other buffers within the page after reads or some such?

@panos-lunarg
Copy link
Contributor

The logic behind page_guard_manager (whether it uses mprotect or uffd) does not know about what is bound to a memory. It only cares about which portion of that memory is mapped.

I think it can be modified to handle that but it's not going to be a straightforward/simple change.

@bradgrantham-lunarg
Copy link
Contributor Author

If we don't set it to be the default, we should potentially set it in CI more frequently than we do.

@bradgrantham-lunarg bradgrantham-lunarg added the P1 Prevents an important capture from being replayed label Oct 31, 2023
@andrew-lunarg
Copy link
Contributor

I would say we should make it the default and perhaps add a little documentation section on tweaking for fidelity for advanced users. I keep coming back to the first experience of a new user. As developers of the tool its hard to appreciate the user experience of trying us out and having a bunch of arcane environment variables and flags to twiddle to get going. For that person, they have no idea whether we are just completely broken: they won't have the confidence that one of these tweaks is going to work. Let's get them up and running and then they can dial in settings for specific needs like ultimate fidelity later.

That docs section:

Guide to Maximum Fidelity in Capture and Replay

Warning, this section covers features for advanced users with very specific needs. These features may make it much harder to achieve a working capture and may require considerable tweaking and you may ultimately find that a correct capture is impossible with them due to the usage of the target API by the application to be captured. ... explanation ensues

@bradgrantham-lunarg bradgrantham-lunarg changed the title Should GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES be the default? Make GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES true by default Feb 29, 2024
@bradgrantham-lunarg
Copy link
Contributor Author

Fixed by #1431

@bradgrantham-lunarg
Copy link
Contributor Author

A note. When an application uses VkMemoryDedicatedAllocateInfo and the developer captures with GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES, replay with --validate will trigger VUID-VkMemoryDedicatedAllocateInfo-image-02964.

During capture with GFXRECON_PAGE_GUARD_ALIGN_BUFFER_SIZES , we pad the size returned from vkGetMemoryRequirements to a page, so the application allocates this altered size. If VVL is run in capture "upchain" of GFXR (before GFXR in the layer list), then as far as VVL can tell, the implementation returned a size and the application allocated that size, so it honored 02964.

But then in replay with --validate, VVL will see the implementation's original size and the captured altered allocation size and trigger 02964.

However, because the application did not allocate the implementation's stated size the app technically used Vulkan in an invalid way in capture but only because GFXR capture told it to.

If the application could be captured successfully with no rendering errors or crash, then this VVL message is probably harmless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Prevents an important capture from being replayed
Projects
None yet
Development

No branches or pull requests

6 participants