GPU Deppart Tiling API by rohanchanani · Pull Request #386 · StanfordLegion/realm

rohanchanani · 2026-01-10T00:08:49Z

Here's an MVP of the new API for just image with no more realm-internal dynamic allocations and a two-pass estimate -> call (the new function is in indexspace.h). The "M" is pulling a lot of weight in MVP, but now it should be easier to concretely think about wiring it up for consumers.

While writing the code for the estimate min/optimal sizes (src/realm/deppart/image.cc:42), I found that it's quite difficult to reason about the correspondence between buffer size and tiling performance as a developer, and I'm struggling to find an estimate that doesn't pass that difficulty on to the user in the form of a drastic difference between the minimum and optimal sizes. @lightsighter - from the code as written, are you able to develop an idea of how you'd have Legion choose a buffer size, and what information you'd want from the realm deppart operation to help that decision? Right now, I think the minimum size is close to right, and the optimal size is a placeholder 5 * volume as I haven't yet done the math to get an upper bound for 1 tile as a function of the volume. If you have any questions on what's going on in any of the code, I'm also available via Slack.
Do you have any general questions/concerns/changes for the general framework? Once we (mostly) lock down the new API, next steps will be to finish implementing the other operations, start wiring it all up to Legion, and make some more precise performance measurements. As @rohany may have mentioned to you, we're planning to submit to SC26 at the end of March, and the experiments we have in mind are:

Microbenchmarks of CPU v. GPU partitioning ops
Initialization cost reduction in full applications
Cupynumeric gather/scatter

lightsighter · 2026-01-10T10:35:15Z

I found that it's quite difficult to reason about the correspondence between buffer size and tiling performance as a developer, and I'm struggling to find an estimate that doesn't pass that difficulty on to the user in the form of a drastic difference between the minimum and optimal sizes

I think that is to be expected. This is a highly non-linear optimization space. Providing a sound lower bound to ensure that the code works is important. I think it's ok for the upper bound to be overly conservative and not necessarily tight in order to ensure peak performance. If users want to pay for the performance in terms of memory they can.

from the code as written, are you able to develop an idea of how you'd have Legion choose a buffer size

The answer to this is easy: I don't have to. As with all performance decisions in Legion I just need to expose this in the mapper interface. 😇 This will show up in the Legion map_partition mapper call.

what information you'd want from the realm deppart operation to help that decision?

I think just the bounds are sufficient for now. The default mapper will probably do something along the lines of first try to allocate the upper bound. If that works then we're done. If not, binary search for the largest allocation that will work and then run with that. More likely the kinds of information I would actually want to write a mapper would be profiling responses that tell me more about how my choices impacted the performance of the deppart operation when it actually ran. Then I can use that feedback to potentially alter my decisions in the future. Fortunately we can add new profiling requests on-demand as we need to.

Do you have any general questions/concerns/changes for the general framework?

Pasting my comment from Slack here: I'm going to propose an alternative API function call which shouldn't be too hard to implement. There shouldn't be special calls in Realm for "GPU" versions of functions, Realm can decide which implementation to use based on the arguments to the function call (e.g. is there a temporary buffer and are the instances visible to the GPU). The function call interface should be an overload of the existing method name and look like this (I'm going to do the field-descriptor based version and not the transform based one since presumably that is the one you've actually implemented and is the one Legion will need immediately):

template<int N2, typename T2>
Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        RegionInstance buffer = RegionInstance::NO_INST,
        Event wait_on = Event::NO_EVENT,
        std::pair<size_t,size_t>* buffer_bounds = nullptr);

The current interface then becomes a special case version of this method and can be implemented by just turning around and calling this version of the method with default arguments to maintain backwards compatibility. The implementation of this method should then have the following logic:

Check if buffer_bounds is not nullptr, if that is the case do the size computation and fill in the buffer bounds and return
Check to see if all the field data is in a location that is capable of running the GPU accelerated pathway, if not dispatch to the CPU pathway
Check to see if a buffer is provided and it has sufficient space for running the GPU accelerated pathway, if not dispatch to the CPU pathway
If you make it here then you can run the GPU-accelerated pathway

As @rohany may have mentioned to you, we're planning to submit to SC26 at the end of March

Yes, I am onboard with this plan.

lightsighter · 2026-01-11T08:46:07Z

I had two additional questions about the temporary instance:

What kind of layout are you expecting? Are there requirements on it, or can I make an opaque instance?
Is it safe to deferred delete the temporary instance contingent upon the completion event of the dependent partitioning operation? (The answer to this should be yes, but just want to make sure.)

lightsighter · 2026-01-12T10:21:43Z

More questions: do you take into account the FieldDataDescriptor objects as part of computing the lower/upper bounds on the temporary buffer size? Even if you don't do it now, can you imagine using that information in the future? What data about the instances in the FieldDataDescriptor objects might you want to introspect?

rohanchanani · 2026-01-14T02:11:32Z

Opaque instance works.

a contiguous range of bytes with a specified alignment

This is exactly what it needs. Right now, I'm creating a 1D instance with 1 field (character) and buffer_size bytes, grabbing a pointer to its base, and defining a custom allocator on that pointer and buffer_size. The allocator is templated on the type of the requested pointer, and aligns its provided pointers accordingly.
2. Yes. The results go into host instance(s) stored in the resulting sparsity map(s), are used as a span, and the sparsity map is responsible for destroying this. Nothing else needs to persist.
3. I've been thinking about this. The only case I can immediately think of is on image range: getting the max volume of the output requires reading all the rectangles (because it's 1 -> n instead of 1 -> 1). The problem with any other use of the instance data is that, if it requires any auxiliary device allocations (e.g. to construct and query a bvh), the estimate itself becomes multi-pass, which I don't think is worth the improvement you'd get in the quality of estimate. So any use of the instance itself would be limited to reading the data, doing some check, and incrementing a counter, although this could still be useful (e.g. counting how many pointers stay in the bounds of an index space).

lightsighter · 2026-01-14T10:36:50Z

The problem with any other use of the instance data is that, if it requires any auxiliary device allocations (e.g. to construct and query a bvh), the estimate itself becomes multi-pass, which I don't think is worth the improvement you'd get in the quality of estimate.

There are actually other problems too. For example, inspecting the index spaces and instances passed into the call might not be safe until the precondition event triggers, so you would have to be sure to wait for that as well which would add additional latency.

To be honest, I have mixed feelings about the interface, but I think it is the right one. While a bit ambiguous, I do like the flexibility that this interface provides us to alter the implementation later. I think it is fine if the first pass implementation does not depend on the input data and is purely a constant bound or a property of the machine (as we previously discussed). In the future, as you point out, we may want implementations that can benefit from introspecting the actual index spaces and instances that are parameters to the call to make a more informed judgement. You could even imagine an scenario in the future where we've trained a small neural network to estimate the bounds for us based on such parameters.

On the Legion side this will be a bit problematic because we'll want the bounds to present to the mapper during the map_partition call which is what will also determine where to place the instances that ultimately would show up in the FieldDataDescriptors. It would be nice if the semantics of the interface are such that Realm will provide rough bounds if no input data is provided to the call, and perhaps will provide tighter bounds if the FieldDataDescriptors are provided. That way clients can get some information out of Realm without having to previously decide where to place data, and can then maybe get better bounds if the provide more information (in the future as we refine the implementation of course).

rohanchanani · 2026-01-21T04:57:35Z

Check to see if a buffer is provided and it has sufficient space for running the GPU accelerated pathway, if not dispatch to the CPU pathway

If they have instances on the device, but not a sufficiently large buffer, the only way to do the operation is to copy the hosts onto the host and then do the operation there. Is that ok?

template<int N2, typename T2>
Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        Event wait_on = Event::NO_EVENT);

The existing API looks like this. If I add a new overload with buffer/buffer bounds default arguments, the compiler can't tell which overload to dispatch to on a call with no event, buffer, or region instance. Instead of adding a new overload, I updated the existing one to this:

Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        Event wait_on = Event::NO_EVENT, RegionInstance buffer = RegionInstance::NO_INST,
        std::pair<size_t, size_t>* buffer_bounds = nullptr,
);

I then updated the function definition to follow the control sequence you described. As written, any existing calls to the function (w or w/o event) will have the same behavior, so it should be backwards compatible.

lightsighter · 2026-01-21T09:19:13Z

I will have some time during the first week of February to do some of the Legion work needed to use this interface so I'd like to have the interface design at least finished at that point so I can write code to it. It would also be good to have a version of Realm that I can compile and possibly run against, even if it only dispatches back to the CPU path for now just to confirm that things are mostly working, and then you'll be able to turn on functionality when you're ready.

rohanchanani · 2026-01-21T17:41:29Z

Are you happy with the API design as is currently pushed to review-tiling for image? Once that's locked down, it should be pretty quick to finish out the rest of the operations with interface first then functionality.

lightsighter · 2026-01-22T05:40:31Z

Ok, I'm waffling again on the API some more. I realized that Realm probably needs to tell us which memory/memories in which to place the intermediate buffer to use. Additionally we probably want to future proof this API a little bit, so let's create a struct for the suggested output configuration for temporary buffers. We can add new fields to the struct in the future without breaking existing user code. Therefore, the API should look something like this:

struct SuggestedInstanceConfiguration {
   // Suggested locations for intermediate instances, sorted from best to worst, can be empty
  std::vector<Memory> suggested_memories;
  // Lower bound on how big an intermediate instance can be for the suggested memories and still be used
  size_t lower_bound_size;
  // Upper bound on the largest intermediate instance Realm thinks is necessary to achieve maximum performance on the suggested memory/memories
  size_t upper_bound_size;
  // Minimum alignment for the intermediate instance
  size_t minimum_alignment;
};

Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        Event wait_on = Event::NO_EVENT, RegionInstance buffer = RegionInstance::NO_INST,
        SuggestedInstanceConfiguration* configuration = nullptr
);

You should be able to reuse this same structure across all deppart API calls that need an intermediate buffer. Thoughts?

rohanchanani · 2026-01-22T05:55:25Z

This looks fine to me. Rohan also suggested having something like a std::map<Memory, ...>, where the ... is metadata describing buffer needed in that memory (from the runtime) + the buffer itself (from the user). This is assuming the partitioning call has instances in multiple different device memories (which Rohan said is the Legion preimage behavior), and we want to dispatch a GPUMicroOp for each device in which you get an instance, each with its own buffer in the right memory. I like the idea of using a struct - it seems increasingly likely that we'll have further updates.

lightsighter · 2026-01-22T06:58:00Z

I think we might need to have a short meeting to discuss preimage/by-field since they both have this behavior where it is actually better to give Realm multiple nodes worth of data in a single API call and then have Realm build acceleration data structures that are used for all the micro-ops that it issues. I'd like to understand a bit more about how you're going to handle those cases and then we can discuss the API design. Will talk with @rohany about when we can setup a meeting tomorrow.

… in buffer descriptor

This reverts commit 04df586.

rohanchanani added 19 commits March 24, 2026 09:31

cleaned up diff

64c32c7

Updated api

45d9973

API that builds

fd2fe62

Finished image API

0987dc8

builds with new APIs (ops themselves are slightly broken)

460710d

Added ifdef REALM_USE_CUDA guards to gpu deppart

d7e9e47

renamed suggested to required and provided target proc instead of mem…

d82566d

… in buffer descriptor

deleted default alignment

0d92106

removed ft from byfield estimate template

59ad878

renamed gpu deppart requirement functions

b2f64a9

Added default initializations to DeppartBufferRequirements

9f7be25

updated 1d image range

a72be3e

working multidimensional, no fixed buffer

c9325ae

working multidimensional

761cd1b

byfield tiled

c05776f

Added host fallback

2182a04

benchmarks done for byfield and image

1fc6368

implemented cpu bvh

3434f6a

preparing to run on perlmutter

7a0c30c

rohanchanani added 10 commits March 24, 2026 09:31

trying full benchmark

83cb1d6

bumped upper bounds

a55e5c6

fixed construct input rectlist

669b69a

fixed overflow

17003b1

fixed overflow

dc8d574

removed prints

0e836f0

picked better host memories

27771ca

for flecsii

07e354a

Export CPU_BVH for shared builds

15628d9

Restore feature-gated source selection

2301eb2

rohanchanani force-pushed the review-tiling branch from 962545c to 2301eb2 Compare March 24, 2026 16:45

rohanchanani added 4 commits March 25, 2026 12:53

Merge remote-tracking branch 'upstream/main' into review-tiling

d63ddee

deppart: add pinned host pool and NVTX tracing

04df586

Revert "deppart: add pinned host pool and NVTX tracing"

de04613

This reverts commit 04df586.

added .codex to gitignore

be8544f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Deppart Tiling API#386

GPU Deppart Tiling API#386
rohanchanani wants to merge 33 commits intoStanfordLegion:mainfrom
rohanchanani:review-tiling

rohanchanani commented Jan 10, 2026

Uh oh!

lightsighter commented Jan 10, 2026

Uh oh!

lightsighter commented Jan 11, 2026

Uh oh!

lightsighter commented Jan 12, 2026

Uh oh!

rohanchanani commented Jan 14, 2026 •

edited

Loading

Uh oh!

lightsighter commented Jan 14, 2026

Uh oh!

rohanchanani commented Jan 21, 2026 •

edited

Loading

Uh oh!

lightsighter commented Jan 21, 2026

Uh oh!

rohanchanani commented Jan 21, 2026

Uh oh!

lightsighter commented Jan 22, 2026

Uh oh!

rohanchanani commented Jan 22, 2026

Uh oh!

lightsighter commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rohanchanani commented Jan 10, 2026

Uh oh!

lightsighter commented Jan 10, 2026

Uh oh!

lightsighter commented Jan 11, 2026

Uh oh!

lightsighter commented Jan 12, 2026

Uh oh!

rohanchanani commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightsighter commented Jan 14, 2026

Uh oh!

rohanchanani commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightsighter commented Jan 21, 2026

Uh oh!

rohanchanani commented Jan 21, 2026

Uh oh!

lightsighter commented Jan 22, 2026

Uh oh!

rohanchanani commented Jan 22, 2026

Uh oh!

lightsighter commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rohanchanani commented Jan 14, 2026 •

edited

Loading

rohanchanani commented Jan 21, 2026 •

edited

Loading