Conversation
I think that is to be expected. This is a highly non-linear optimization space. Providing a sound lower bound to ensure that the code works is important. I think it's ok for the upper bound to be overly conservative and not necessarily tight in order to ensure peak performance. If users want to pay for the performance in terms of memory they can.
The answer to this is easy: I don't have to. As with all performance decisions in Legion I just need to expose this in the mapper interface. 😇 This will show up in the Legion
I think just the bounds are sufficient for now. The default mapper will probably do something along the lines of first try to allocate the upper bound. If that works then we're done. If not, binary search for the largest allocation that will work and then run with that. More likely the kinds of information I would actually want to write a mapper would be profiling responses that tell me more about how my choices impacted the performance of the deppart operation when it actually ran. Then I can use that feedback to potentially alter my decisions in the future. Fortunately we can add new profiling requests on-demand as we need to.
Pasting my comment from Slack here: I'm going to propose an alternative API function call which shouldn't be too hard to implement. There shouldn't be special calls in Realm for "GPU" versions of functions, Realm can decide which implementation to use based on the arguments to the function call (e.g. is there a temporary buffer and are the instances visible to the GPU). The function call interface should be an overload of the existing method name and look like this (I'm going to do the field-descriptor based version and not the transform based one since presumably that is the one you've actually implemented and is the one Legion will need immediately): The current interface then becomes a special case version of this method and can be implemented by just turning around and calling this version of the method with default arguments to maintain backwards compatibility. The implementation of this method should then have the following logic:
Yes, I am onboard with this plan. |
|
I had two additional questions about the temporary instance:
|
|
More questions: do you take into account the |
This is exactly what it needs. Right now, I'm creating a 1D instance with 1 field (character) and buffer_size bytes, grabbing a pointer to its base, and defining a custom allocator on that pointer and buffer_size. The allocator is templated on the type of the requested pointer, and aligns its provided pointers accordingly. |
There are actually other problems too. For example, inspecting the index spaces and instances passed into the call might not be safe until the precondition event triggers, so you would have to be sure to wait for that as well which would add additional latency. To be honest, I have mixed feelings about the interface, but I think it is the right one. While a bit ambiguous, I do like the flexibility that this interface provides us to alter the implementation later. I think it is fine if the first pass implementation does not depend on the input data and is purely a constant bound or a property of the machine (as we previously discussed). In the future, as you point out, we may want implementations that can benefit from introspecting the actual index spaces and instances that are parameters to the call to make a more informed judgement. You could even imagine an scenario in the future where we've trained a small neural network to estimate the bounds for us based on such parameters. On the Legion side this will be a bit problematic because we'll want the bounds to present to the mapper during the |
If they have instances on the device, but not a sufficiently large buffer, the only way to do the operation is to copy the hosts onto the host and then do the operation there. Is that ok? The existing API looks like this. If I add a new overload with buffer/buffer bounds default arguments, the compiler can't tell which overload to dispatch to on a call with no event, buffer, or region instance. Instead of adding a new overload, I updated the existing one to this: I then updated the function definition to follow the control sequence you described. As written, any existing calls to the function (w or w/o event) will have the same behavior, so it should be backwards compatible. |
|
I will have some time during the first week of February to do some of the Legion work needed to use this interface so I'd like to have the interface design at least finished at that point so I can write code to it. It would also be good to have a version of Realm that I can compile and possibly run against, even if it only dispatches back to the CPU path for now just to confirm that things are mostly working, and then you'll be able to turn on functionality when you're ready. |
|
Are you happy with the API design as is currently pushed to review-tiling for image? Once that's locked down, it should be pretty quick to finish out the rest of the operations with interface first then functionality. |
|
Ok, I'm waffling again on the API some more. I realized that Realm probably needs to tell us which memory/memories in which to place the intermediate buffer to use. Additionally we probably want to future proof this API a little bit, so let's create a struct for the suggested output configuration for temporary buffers. We can add new fields to the struct in the future without breaking existing user code. Therefore, the API should look something like this: You should be able to reuse this same structure across all deppart API calls that need an intermediate buffer. Thoughts? |
|
This looks fine to me. Rohan also suggested having something like a std::map<Memory, ...>, where the ... is metadata describing buffer needed in that memory (from the runtime) + the buffer itself (from the user). This is assuming the partitioning call has instances in multiple different device memories (which Rohan said is the Legion preimage behavior), and we want to dispatch a GPUMicroOp for each device in which you get an instance, each with its own buffer in the right memory. I like the idea of using a struct - it seems increasingly likely that we'll have further updates. |
|
I think we might need to have a short meeting to discuss preimage/by-field since they both have this behavior where it is actually better to give Realm multiple nodes worth of data in a single API call and then have Realm build acceleration data structures that are used for all the micro-ops that it issues. I'd like to understand a bit more about how you're going to handle those cases and then we can discuss the API design. Will talk with @rohany about when we can setup a meeting tomorrow. |
… in buffer descriptor
962545c to
2301eb2
Compare
@rohany @lightsighter
Here's an MVP of the new API for just image with no more realm-internal dynamic allocations and a two-pass estimate -> call (the new function is in indexspace.h). The "M" is pulling a lot of weight in MVP, but now it should be easier to concretely think about wiring it up for consumers.
While writing the code for the estimate min/optimal sizes (src/realm/deppart/image.cc:42), I found that it's quite difficult to reason about the correspondence between buffer size and tiling performance as a developer, and I'm struggling to find an estimate that doesn't pass that difficulty on to the user in the form of a drastic difference between the minimum and optimal sizes. @lightsighter - from the code as written, are you able to develop an idea of how you'd have Legion choose a buffer size, and what information you'd want from the realm deppart operation to help that decision? Right now, I think the minimum size is close to right, and the optimal size is a placeholder 5 * volume as I haven't yet done the math to get an upper bound for 1 tile as a function of the volume. If you have any questions on what's going on in any of the code, I'm also available via Slack.
Do you have any general questions/concerns/changes for the general framework? Once we (mostly) lock down the new API, next steps will be to finish implementing the other operations, start wiring it all up to Legion, and make some more precise performance measurements. As @rohany may have mentioned to you, we're planning to submit to SC26 at the end of March, and the experiments we have in mind are: