You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We never created a calling convention to differentiate between function results that may alias a global allocation and those that can be assumed uniquely transferred to the caller. The reason this matters is that when receiving a result of unknown providence, the caller needs to be able to decide whether it should do stream ordered deallocation scheduling or just rely on global refcount based destruction. Making the wrong decision can have very adverse outcomes:
If taking ownership and scheduling async deallocation on globally held buffers, this can result in prematurely freed resources.
If failing to take ownership and falling back on global deallocation, this comes at very large performance penalties on most backends as they have to adopt very conservative, stop-the-world deallocation behavior.
While we would ideally like to be conservatively correct, the performance implications are too great to ignore, especially considering that we have no example or real programs that violate the aliasing rule. While I can imagine them, with current frontends, it is not even straight-forward to construct such scenarios.
There is an override that can be used by the programmer when they know aliasing is possible:
For initial patch landing, the assume_no_alias flag will be on, forcing aggressive behavior. It can be enabled/disabled at the process or invocation level.
* Device allocations are now async, queue ordered alloc/dealloc.
* Program invocations asynchronously deallocate function call results if
it can. If it ever cannot, then a small tracy zone
`SyncImportTimelineResource` will be emitted per result that cannot be
async deallocated.
* Adds `ProgramInvocation.assume_no_alias` instance boolean to disable
the assumption which allows async deallocation to work.
* Adds global `ProgramIncovation.global_no_alias` property to control
process-wide.
This is a very fiddly optimization which requires (esp in multi-device
cases) a number of things to line up. Tested on amdgpu and CPU with a
number of sample workloads (with logging enabled and visually
confirmed).
See #980 for detailed analysis and further work required.
We never created a calling convention to differentiate between function results that may alias a global allocation and those that can be assumed uniquely transferred to the caller. The reason this matters is that when receiving a result of unknown providence, the caller needs to be able to decide whether it should do stream ordered deallocation scheduling or just rely on global refcount based destruction. Making the wrong decision can have very adverse outcomes:
While we would ideally like to be conservatively correct, the performance implications are too great to ignore, especially considering that we have no example or real programs that violate the aliasing rule. While I can imagine them, with current frontends, it is not even straight-forward to construct such scenarios.
There is an override that can be used by the programmer when they know aliasing is possible:
For initial patch landing, the assume_no_alias flag will be on, forcing aggressive behavior. It can be enabled/disabled at the process or invocation level.
Per invocation:
Globally:
Keeping this issue as documentation and to track complete resolution.
The text was updated successfully, but these errors were encountered: