PR #119 leverages our CUDA.jl tasking system.
However, launch(...) will call Legate.default_alignment(...) on all inputs and outputs.
This is fine for standard elementwise operations like:
However, it is inefficient for stencil computations where inputs are shifted views of the same array.
For example, consider a 6×6 grid split across two tasks, where task 0 owns the top half and task 1 owns the bottom half:
grid = zeros(6, 6)
center = grid[2:5, 2:5] # 4×4 interior
south = grid[3:6, 2:5] # shifted down by 1 row
With default_alignment, center and south get identical tile boundaries:
# task 0
center_t0 = grid[2:3, 2:5]
south_t0 = grid[3:4, 2:5]
# task 1
center_t1 = grid[4:5, 2:5]
south_t1 = grid[5:6, 2:5]
This is a problem for stencil computations because the shifted views require halo data from neighboring partitions. For a Jacobi stencil using all four neighbors, this creates halo copies for north, south, east, and west every iteration.
Instead, this should be represented with a bloat constraint:
bloat(source=center, bloat=south, low=0, high=1)
This gives each task the necessary overlap in its physical instance:
# task 0
center_t0 = grid[2:3, 2:5]
south_t0 = grid[3:5, 2:5]
# task 1
center_t1 = grid[4:5, 2:5]
south_t1 = grid[4:6, 2:5]
Now the required halo rows are included by construction. The overlap is handled once during partitioning rather than copied every iteration.
For example, with 1000 Jacobi iterations over a 1000×1000 grid and four neighbors:
- default_alignment: pays halo-copy cost 4000 times
- bloat: pays overlap cost once at partitioning time
So we likely need a way for CUDA.jl tasks to specify Legate partitioning constraints other than default_alignment.
See more details here about the various constraints in legate.
PR #119 leverages our CUDA.jl tasking system.
However, launch(...) will call Legate.default_alignment(...) on all inputs and outputs.
This is fine for standard elementwise operations like:
a .+ bHowever, it is inefficient for stencil computations where inputs are shifted views of the same array.
For example, consider a 6×6 grid split across two tasks, where task 0 owns the top half and task 1 owns the bottom half:
With
default_alignment,centerandsouthget identical tile boundaries:This is a problem for stencil computations because the shifted views require halo data from neighboring partitions. For a Jacobi stencil using all four neighbors, this creates halo copies for north, south, east, and west every iteration.
Instead, this should be represented with a bloat constraint:
This gives each task the necessary overlap in its physical instance:
Now the required halo rows are included by construction. The overlap is handled once during partitioning rather than copied every iteration.
For example, with 1000 Jacobi iterations over a 1000×1000 grid and four neighbors:
So we likely need a way for CUDA.jl tasks to specify Legate partitioning constraints other than
default_alignment.See more details here about the various constraints in legate.