Skip to content

GPU Deppart Tiling API#386

Draft
rohanchanani wants to merge 33 commits intoStanfordLegion:mainfrom
rohanchanani:review-tiling
Draft

GPU Deppart Tiling API#386
rohanchanani wants to merge 33 commits intoStanfordLegion:mainfrom
rohanchanani:review-tiling

Conversation

@rohanchanani
Copy link
Copy Markdown

@rohany @lightsighter

Here's an MVP of the new API for just image with no more realm-internal dynamic allocations and a two-pass estimate -> call (the new function is in indexspace.h). The "M" is pulling a lot of weight in MVP, but now it should be easier to concretely think about wiring it up for consumers.

  1. While writing the code for the estimate min/optimal sizes (src/realm/deppart/image.cc:42), I found that it's quite difficult to reason about the correspondence between buffer size and tiling performance as a developer, and I'm struggling to find an estimate that doesn't pass that difficulty on to the user in the form of a drastic difference between the minimum and optimal sizes. @lightsighter - from the code as written, are you able to develop an idea of how you'd have Legion choose a buffer size, and what information you'd want from the realm deppart operation to help that decision? Right now, I think the minimum size is close to right, and the optimal size is a placeholder 5 * volume as I haven't yet done the math to get an upper bound for 1 tile as a function of the volume. If you have any questions on what's going on in any of the code, I'm also available via Slack.

  2. Do you have any general questions/concerns/changes for the general framework? Once we (mostly) lock down the new API, next steps will be to finish implementing the other operations, start wiring it all up to Legion, and make some more precise performance measurements. As @rohany may have mentioned to you, we're planning to submit to SC26 at the end of March, and the experiments we have in mind are:

  • Microbenchmarks of CPU v. GPU partitioning ops
  • Initialization cost reduction in full applications
  • Cupynumeric gather/scatter

@lightsighter
Copy link
Copy Markdown
Contributor

I found that it's quite difficult to reason about the correspondence between buffer size and tiling performance as a developer, and I'm struggling to find an estimate that doesn't pass that difficulty on to the user in the form of a drastic difference between the minimum and optimal sizes

I think that is to be expected. This is a highly non-linear optimization space. Providing a sound lower bound to ensure that the code works is important. I think it's ok for the upper bound to be overly conservative and not necessarily tight in order to ensure peak performance. If users want to pay for the performance in terms of memory they can.

from the code as written, are you able to develop an idea of how you'd have Legion choose a buffer size

The answer to this is easy: I don't have to. As with all performance decisions in Legion I just need to expose this in the mapper interface. 😇 This will show up in the Legion map_partition mapper call.

what information you'd want from the realm deppart operation to help that decision?

I think just the bounds are sufficient for now. The default mapper will probably do something along the lines of first try to allocate the upper bound. If that works then we're done. If not, binary search for the largest allocation that will work and then run with that. More likely the kinds of information I would actually want to write a mapper would be profiling responses that tell me more about how my choices impacted the performance of the deppart operation when it actually ran. Then I can use that feedback to potentially alter my decisions in the future. Fortunately we can add new profiling requests on-demand as we need to.

Do you have any general questions/concerns/changes for the general framework?

Pasting my comment from Slack here: I'm going to propose an alternative API function call which shouldn't be too hard to implement. There shouldn't be special calls in Realm for "GPU" versions of functions, Realm can decide which implementation to use based on the arguments to the function call (e.g. is there a temporary buffer and are the instances visible to the GPU). The function call interface should be an overload of the existing method name and look like this (I'm going to do the field-descriptor based version and not the transform based one since presumably that is the one you've actually implemented and is the one Legion will need immediately):

template<int N2, typename T2>
Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        RegionInstance buffer = RegionInstance::NO_INST,
        Event wait_on = Event::NO_EVENT,
        std::pair<size_t,size_t>* buffer_bounds = nullptr);

The current interface then becomes a special case version of this method and can be implemented by just turning around and calling this version of the method with default arguments to maintain backwards compatibility. The implementation of this method should then have the following logic:

  1. Check if buffer_bounds is not nullptr, if that is the case do the size computation and fill in the buffer bounds and return
  2. Check to see if all the field data is in a location that is capable of running the GPU accelerated pathway, if not dispatch to the CPU pathway
  3. Check to see if a buffer is provided and it has sufficient space for running the GPU accelerated pathway, if not dispatch to the CPU pathway
  4. If you make it here then you can run the GPU-accelerated pathway

As @rohany may have mentioned to you, we're planning to submit to SC26 at the end of March

Yes, I am onboard with this plan.

@lightsighter
Copy link
Copy Markdown
Contributor

I had two additional questions about the temporary instance:

  1. What kind of layout are you expecting? Are there requirements on it, or can I make an opaque instance?
  2. Is it safe to deferred delete the temporary instance contingent upon the completion event of the dependent partitioning operation? (The answer to this should be yes, but just want to make sure.)

@lightsighter
Copy link
Copy Markdown
Contributor

More questions: do you take into account the FieldDataDescriptor objects as part of computing the lower/upper bounds on the temporary buffer size? Even if you don't do it now, can you imagine using that information in the future? What data about the instances in the FieldDataDescriptor objects might you want to introspect?

@rohanchanani
Copy link
Copy Markdown
Author

rohanchanani commented Jan 14, 2026

  1. Opaque instance works.

a contiguous range of bytes with a specified alignment

This is exactly what it needs. Right now, I'm creating a 1D instance with 1 field (character) and buffer_size bytes, grabbing a pointer to its base, and defining a custom allocator on that pointer and buffer_size. The allocator is templated on the type of the requested pointer, and aligns its provided pointers accordingly.
2. Yes. The results go into host instance(s) stored in the resulting sparsity map(s), are used as a span, and the sparsity map is responsible for destroying this. Nothing else needs to persist.
3. I've been thinking about this. The only case I can immediately think of is on image range: getting the max volume of the output requires reading all the rectangles (because it's 1 -> n instead of 1 -> 1). The problem with any other use of the instance data is that, if it requires any auxiliary device allocations (e.g. to construct and query a bvh), the estimate itself becomes multi-pass, which I don't think is worth the improvement you'd get in the quality of estimate. So any use of the instance itself would be limited to reading the data, doing some check, and incrementing a counter, although this could still be useful (e.g. counting how many pointers stay in the bounds of an index space).

@lightsighter
Copy link
Copy Markdown
Contributor

The problem with any other use of the instance data is that, if it requires any auxiliary device allocations (e.g. to construct and query a bvh), the estimate itself becomes multi-pass, which I don't think is worth the improvement you'd get in the quality of estimate.

There are actually other problems too. For example, inspecting the index spaces and instances passed into the call might not be safe until the precondition event triggers, so you would have to be sure to wait for that as well which would add additional latency.

To be honest, I have mixed feelings about the interface, but I think it is the right one. While a bit ambiguous, I do like the flexibility that this interface provides us to alter the implementation later. I think it is fine if the first pass implementation does not depend on the input data and is purely a constant bound or a property of the machine (as we previously discussed). In the future, as you point out, we may want implementations that can benefit from introspecting the actual index spaces and instances that are parameters to the call to make a more informed judgement. You could even imagine an scenario in the future where we've trained a small neural network to estimate the bounds for us based on such parameters.

On the Legion side this will be a bit problematic because we'll want the bounds to present to the mapper during the map_partition call which is what will also determine where to place the instances that ultimately would show up in the FieldDataDescriptors. It would be nice if the semantics of the interface are such that Realm will provide rough bounds if no input data is provided to the call, and perhaps will provide tighter bounds if the FieldDataDescriptors are provided. That way clients can get some information out of Realm without having to previously decide where to place data, and can then maybe get better bounds if the provide more information (in the future as we refine the implementation of course).

@rohanchanani
Copy link
Copy Markdown
Author

rohanchanani commented Jan 21, 2026

  1. Check to see if a buffer is provided and it has sufficient space for running the GPU accelerated pathway, if not dispatch to the CPU pathway

If they have instances on the device, but not a sufficiently large buffer, the only way to do the operation is to copy the hosts onto the host and then do the operation there. Is that ok?

template<int N2, typename T2>
Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        Event wait_on = Event::NO_EVENT);

The existing API looks like this. If I add a new overload with buffer/buffer bounds default arguments, the compiler can't tell which overload to dispatch to on a call with no event, buffer, or region instance. Instead of adding a new overload, I updated the existing one to this:

Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        Event wait_on = Event::NO_EVENT, RegionInstance buffer = RegionInstance::NO_INST,
        std::pair<size_t, size_t>* buffer_bounds = nullptr,
);

I then updated the function definition to follow the control sequence you described. As written, any existing calls to the function (w or w/o event) will have the same behavior, so it should be backwards compatible.

@lightsighter
Copy link
Copy Markdown
Contributor

I will have some time during the first week of February to do some of the Legion work needed to use this interface so I'd like to have the interface design at least finished at that point so I can write code to it. It would also be good to have a version of Realm that I can compile and possibly run against, even if it only dispatches back to the CPU path for now just to confirm that things are mostly working, and then you'll be able to turn on functionality when you're ready.

@rohanchanani
Copy link
Copy Markdown
Author

Are you happy with the API design as is currently pushed to review-tiling for image? Once that's locked down, it should be pretty quick to finish out the rest of the operations with interface first then functionality.

@lightsighter
Copy link
Copy Markdown
Contributor

Ok, I'm waffling again on the API some more. I realized that Realm probably needs to tell us which memory/memories in which to place the intermediate buffer to use. Additionally we probably want to future proof this API a little bit, so let's create a struct for the suggested output configuration for temporary buffers. We can add new fields to the struct in the future without breaking existing user code. Therefore, the API should look something like this:

struct SuggestedInstanceConfiguration {
   // Suggested locations for intermediate instances, sorted from best to worst, can be empty
  std::vector<Memory> suggested_memories;
  // Lower bound on how big an intermediate instance can be for the suggested memories and still be used
  size_t lower_bound_size;
  // Upper bound on the largest intermediate instance Realm thinks is necessary to achieve maximum performance on the suggested memory/memories
  size_t upper_bound_size;
  // Minimum alignment for the intermediate instance
  size_t minimum_alignment;
};

Event IndexSpace<N, T>::subspaces_by_image(
        const std::vector<FieldDataDescriptor<IndexSpace<N2, T2>, Point<N, T>>> &field_data,
        const std::vector<IndexSpace<N2, T2>> &sources,
        std::vector<IndexSpace<N, T>> &images, const ProfilingRequestSet &reqs,
        Event wait_on = Event::NO_EVENT, RegionInstance buffer = RegionInstance::NO_INST,
        SuggestedInstanceConfiguration* configuration = nullptr
);

You should be able to reuse this same structure across all deppart API calls that need an intermediate buffer. Thoughts?

@rohanchanani
Copy link
Copy Markdown
Author

This looks fine to me. Rohan also suggested having something like a std::map<Memory, ...>, where the ... is metadata describing buffer needed in that memory (from the runtime) + the buffer itself (from the user). This is assuming the partitioning call has instances in multiple different device memories (which Rohan said is the Legion preimage behavior), and we want to dispatch a GPUMicroOp for each device in which you get an instance, each with its own buffer in the right memory. I like the idea of using a struct - it seems increasingly likely that we'll have further updates.

@lightsighter
Copy link
Copy Markdown
Contributor

I think we might need to have a short meeting to discuss preimage/by-field since they both have this behavior where it is actually better to give Realm multiple nodes worth of data in a single API call and then have Realm build acceleration data structures that are used for all the micro-ops that it issues. I'd like to understand a bit more about how you're going to handle those cases and then we can discuss the API design. Will talk with @rohany about when we can setup a meeting tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants