The project implements a 2D diffusion equation solver in Julia using ImplicitGlobalGrid.jl and ParallelStencil.jl. The solver is designed for multi-xPU parallelized execution to achieve high-performance, scalable simulation of a pure diffusion process by solving the time evolution of an initial Gaussian concentration distribution. The project was implemented and tested on Piz Daint multi-xPU nodes provided by the Swiss National Supercomputing Centre (CSCS), as part of the course "Solving Partial Differential Equations in Parallel on Graphic Processing Units" at ETH Zürich.
The governing partial differential equation (PDE) for the simulation is:
where:
-
$C(x, y, t)$ is the concentration field. -
$D$ is the diffusion coefficient. -
$\nabla \cdot (D \nabla C)$ represents the Laplacian term governing the diffusive transport.
- Discretization and Numerical Scheme
- Parallelization and Communication Hiding
- Reproducing the Diffusion Simulation Using 4 GPU Nodes
- Validation: Comparing Single-xPU Outputs
- Validation: Comparing Single-GPU and Multi-GPU Outputs
- Strong Scaling
- Weak Scaling
- Hiding Communications
The implementation employs finite differences for spatial discretization:
-
Flux Calculation:
- The fluxes in the
$x$ - and$y$ - directions are discretized as:
- The fluxes in the
-
Divergence of Flux (FVM Approximation):
- Using conservation form, the update step for
$C$ is:
- Using conservation form, the update step for
where
The parallelization design focuses on combining features from ImplicitGlobalGrid.jl
for domain decomposition and ParallelStencil.jl
for high-performance kernel execution.
We initialize the solver dynamically to target multi-GPU execution or multi-threaded CPUs, depending on the user configuration. For example, GPU execution is set up as follows:
@init_parallel_stencil(CUDA, Float64, 2)
The 2D computational domain is globally partitioned across available processors (xPUs) using ImplicitGlobalGrid.jl
. This allows each process to work on its local subdomain while synchronizing boundary data through halo exchanges. The grid updates rely on parallel kernel execution with nested parallelism, enabling simultaneous computation across dimensions. A representative update step is shown here:
@parallel_indices (ix, iy) function compute!(C2, C, ...)
C2[ix+1, iy+1] = C[ix+1, iy+1] - dt * ((@qx(...) - @qx(...)) * _dx + ...)
end
To further enhance performance, halo updates are overlapped with computation using the @hide_communication
construct, reducing idle times during inter-subdomain data exchanges. Performance metrics such as effective memory bandwidth (T_eff
) are tracked to evaluate efficiency and optimize the implementation as needed.
For a comprehensive discussion of the implementation details and parallelization design, please refer to Parallelization.md.
The animation above shows the diffusion process over a 2D grid using distributed computing across multiple GPUs. To reproduce the animation, we ran the multi-xpu scrint using 4 compute nodes on Piz Daint:
srun -n 4 julia --project=. diffusion_2D_perf_multixpu.jl
Sample Output:
Global grid: 250x250x1 (nprocs: 4, dims: 2x2x1)
Animation directory: ./viz2D_mxpu_out/
Time = 37.974 sec, T_eff = 0.02 GB/s (niter = 2553)
Time = 37.996 sec, T_eff = 0.02 GB/s (niter = 2553)
Time = 38.005 sec, T_eff = 0.02 GB/s (niter = 2553)
Time = 38.028 sec, T_eff = 0.02 GB/s (niter = 2553)
[ Info: Saved animation to /scratch/snx3000/class203/diffusion_2D_mxpu.gif
We use the script l8_diffusion_2D_perf_xpu.jl
, which is designed for single-xPU execution (no domain decomposition or inter-xPU communication), to validate that CPU and GPU runs with the same parameters produced consistent outputs. We confirmed that the parallel finite difference updates behaved reliably on both multi-threaded CPUs and GPUs before extending the implementation to a multi-xPU setup.
![]() |
![]() |
Fig. 1: Simulation executed on GPU (USE_GPU = true ).
|
Fig. 2: Simulation executed on CPU (USE_GPU = false ).
|
We used a script compare_outputs.jl
to numerically measure and visualize the maximum absolute difference between the outputs. Running the script, we had the following output:
Maximum difference between CPU and GPU outputs: 2.7755575615628914e-17
The values for each run were saved in the files C_output_gpu_true.jld2
and C_output_gpu_false.jld2
, respectively. The absolute difference between the CPU and GPU runs is visualized using the following heatmap (take note of the colorbar values, which shows negligible differences):
To demonstrate the performance and scalability differences between single-GPU (using l8_diffusion_2D_perf_xpu.jl
) and multi-GPU (using diffusion_2D_perf_multixpu.jl
on 4 GPU nodes) execution, we compare runs using the same problem size and parameters. The multi-GPU setup partitions the domain across multiple GPUs, enabling parallel computation with inter-xPU communication, while the single-GPU run processes the entire domain locally.
![]() |
![]() |
Fig. 3: Animation generated by the current diffusion_2D_perf_multixpu.jl script.
|
Fig. 4: Animation generated by the l8_diffusion_2D_perf_xpu.jl script, executed on GPU (USE_GPU = true ).
|
We used the script compare_multixpu_output.jl
to numerically compare the differences:
Normalized difference between Multi-xPU and Single xPU outputs: 0.009627067026212391
Multi-xPU data range: 9.538779518364063e-7 to 0.20249236031530343
Single xPU data range (inner): 1.048443464878423e-6 to 0.19975776702637837
Sample Multi-xPU: [9.538779518364063e-7 1.9248901673983545e-6 2.930264433408078e-6 3.987413094031169e-6 5.114019114712718e-6; 1.9248901673983545e-6 3.884354834904642e-6 5.913157193460889e-6 8.046432794307663e-6 1.0319863188812344e-5; 2.930264433408078e-6 5.913157193460889e-6 9.001593763648288e-6 1.2249056298186256e-5 1.5709857869935528e-5; 3.987413094031169e-6 8.046432794307663e-6 1.2249056298186256e-5 1.6668051586943227e-5 2.1377316849179705e-5; 5.114019114712719e-6 1.0319863188812344e-5 1.5709857869935528e-5 2.1377316849179705e-5 2.7417002585183226e-5]
Sample Single xPU: [1.0484434648784232e-6 2.1151006657843403e-6 3.2182763586032985e-6 4.376454752446499e-6 5.608382779576605e-6; 2.1151006657843403e-6 4.266943365440956e-6 6.4924537195998065e-6 8.828919304939611e-6 1.1314157607796204e-5; 3.218276358603299e-6 6.492453719599806e-6 9.878712340398283e-6 1.3433783050808211e-5 1.7215202724471308e-5; 4.376454752446498e-6 8.828919304939611e-6 1.3433783050808211e-5 1.826818377783561e-5 2.34103549250958e-5; 5.608382779576605e-6 1.1314157607796204e-5 1.7215202724471308e-5 2.34103549250958e-5 2.9999854323200893e-5]
The small variance between the multi-xPU and single-xPU outputs (normalized difference: 0.0096) can be attributed to differences in numerical precision and potential communication overhead during inter-xPU data exchanges in the multi-GPU setup.
We developed a script for benchmarking experiments (diffusion_2D_perf_multixpu_benchmarking.jl
), along with run_strong_scaling.jl
to execute the benchmarking process.
The strong scaling scheme measures the effective memory throughput (
as the problem size is fixed and computational resources (xPUs) increase. Ideally, the expected performance is a linear increase in
We ran the script as follows:
srun -n 1 julia --project=. run_strong_scaling.jl
Running the script will save the results into a text file and create a corresponding visualization:
Grid Size ( |
|
---|---|
The effective memory throughput
We developed the script diffusion_2D_perf_multixpu_weak_scaling.jl
for benchmarking experiments and created run_weak_scaling.jl
to perform the benchmarking process.
The weak scaling scheme measures the normalized execution time as the problem size per process is held constant while the number of processes (
Ideally, the normalized execution time should remain constant as
We ran the script using the following command:
srun -n 1 julia --project=. weak_strong_scaling.jl
Running the script will save the results into a text file and create a corresponding visualization:
Number of Processes ( |
Normalized Execution Time ( |
---|---|
The normalized execution time increases with the number of processes (
Additional Note:
For weak scaling, we adjusted
Running for np = 64...
srun: Job 58273220 step creation temporarily disabled, retrying (Requested nodes are busy)
srun: Job 58273220 step creation still disabled, retrying (Requested nodes are busy)
srun: Job 58273220 step creation still disabled, retrying (Requested nodes are busy)
Hiding communication overlaps halo updates with computations, aiming to reduce idle time and improve parallel performance by minimizing synchronization overhead. We tested configurations using @hide_communication
with varying overlap degrees (e.g., (2,2)
, (8,2)
) against a baseline without hiding. Ideally, increased overlap should lead to lower normalized execution times, reflecting improved efficiency.
for it = 1:nt
@hide_communication (tuple_val[1], tuple_val[2]) begin
@parallel compute!(C2, C, 1.0 / dx, 1.0 / dy, dt, 1.0 / dx, 1.0 / dy, size(C, 1) - 2, size(C, 2) - 2)
C, C2 = C2, C
update_halo!(C)
end
end
We used the scripts diffusion_2D_perf_multixpu_hide_comm.jl
and run_hide_comm_benchmarking.jl
for this task.
Configuration ( |
Normalized Execution Time ( |
---|---|
The results show minimal improvements, with normalized execution times only slightly reduced to around
These results were obtained with