-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi @zfergus ,
I was recently learning a bit more about parallelizing contact simulation on the GPU, specifically focused on broad phase collision candidate determination. Your related article Time of Impact Dataset for Continuous Collision Detection and a Scalable Conservative Algorithm, as well as this public repository were both quite helpful to me.
I wanted to inquire about the reasoning behind STQ mapping better to the GPU than SAP. I'm a bit confused about this. In both approaches, the preprocessing and sorting phases are the same, i.e. we
- Compute bounding boxes
- Compute their statistical mean and variance
- Sort them along largest variance axis
The difference between STQ and SAP is the sweep phase. In both approaches, algorithmically, we associate 1 thread per bounding box i, which incrementally does look-aheads, i.e. starting from its immediate next bounding box neighbor in sorted order, it keeps looking at the following neighbor as a potential overlap candidate j until j does not overlap with i along the sort axis. For every potential overlap candidate j, if there is no topological incidence between i and j and if they both overlap along all other axis', then they should be enqueued as collision candidates in a device-wide atomic fashion. However, STQ (as opposed to SAP) has additional synchronization steps, i.e. warp-wide synchronization, on top of the baseline device-wide synchronization. Work does not seem to be shared between threads of a common warp, i.e. there is no work stealing, so each thread in STQ has the same amount of work as it would have in SAP.
Do you mind elaborating on why STQ is a more favorable approach on the GPU versus SAP? I'm sure I'm missing something. There must be a reason that using a warp-wide shared queue is preferable.