Skip to content

Commit 2f34d0f

Browse files
authored
Merge pull request #6 from FireDynamics/release-v0.2
Release v0.2
2 parents a6c4dba + aeea376 commit 2f34d0f

File tree

16 files changed

+222
-3
lines changed

16 files changed

+222
-3
lines changed

book/_toc.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@
5252
- file: content/tools/02_hpc/00_overview
5353
sections:
5454
- file: content/tools/02_hpc/01_linux
55-
# - file: content/tools/02_hpc/02_hpc
56-
- file: content/tools/02_hpc/03_parallel_fds
55+
- file: content/tools/02_hpc/02_hpc
56+
# - file: content/tools/02_hpc/03_parallel_fds
57+
5758

book/content/references.bib

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
1+
@techreport{Haarhoff.2014,
2+
author = {Haarhoff, Daniel and Arnold, Lukas},
3+
title = {{P}erformance {A}nalysis and {S}hared {M}emory
4+
{P}arallelisation of {FDS}},
5+
pages = {13},
6+
year = {2014},
7+
month = {Sep},
8+
institution = {Fire and Evacuation Modelling
9+
Technical Conference 2014, Gaithersburg
10+
(USA), 8 Sep 2014 - 10 Sep 2014},
11+
url = {https://juser.fz-juelich.de/record/156014},
12+
}
13+
114
@techreport{CFAST7-TR.2021,
215
year = {2021},
316
author = {Peacock, Richard D. and McGrattan, Kevin B. and Forney, Glenn P. and Reneke, Paul A.},
Lines changed: 154 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,154 @@
1-
# High Performace Computing
1+
# High Performace Computing
2+
3+
## Overview
4+
5+
Many scientific and engineering problems require a large amount of computing time and memory. This is generally only available on supercomputers / high performance computing (HPC) systems.
6+
7+
Typical applications are:
8+
9+
* science: particle physics, climate research, molecular dynamics
10+
* engineering: CFD, structural mechanics, computer science
11+
12+
HPC with computing power thousands times larger than personal computer allows to significantly reduce the computing time and allows to address new modelling challenges. This is also becoming true for the fire safety science and engineering communities.
13+
14+
*Example:* An CFD application run for a week on thousand processors would need 20 years on a personal computer (assuming it provides enough memory).
15+
16+
The evolution of computer technology shows one of the fastest development of all technolgies. Considering the nubmer of transistors as a measure for a chip's computing performance, the evolution shows that this number doubles every two years. This observation is called [Moore's law](https://en.wikipedia.org/wiki/Moore%27s_law), see {numref}`fig-hpc-moore`.
17+
18+
:::{figure-md} fig-hpc-moore
19+
<img src="https://upload.wikimedia.org/wikipedia/commons/0/00/Moore%27s_Law_Transistor_Count_1970-2020.png" width="80%">
20+
21+
The development of the transistor count in the last 5 decades. This observation is represented by Moore's law. Source: [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Moore%27s_Law_Transistor_Count_1970-2020.png).
22+
:::
23+
24+
## Parallel Execution
25+
26+
### General Principle
27+
28+
An application, is in general a sequence of tasks. Here, a task can be understood as a group of commands or computations that can no further be separated, e.g. due to data dependency.
29+
30+
Yet, there is not necessary a depencece among the tasks. Thus, they can eventually be executed at the same time, see {numref}`fig-hpc-fork`. The process of distributing the tasks on multiple execution units (threads, cores, nodes) is called a fork, while the gathering of the results back to the main execution unit is called a join. During the fork, data and instructions have to be exchanged, thus this introduces an overhead, which is not present in a serial execution. The same is true for the join process, but an additional contribution to the overhead is induced by load balancing issues. Depending on the execution time of individual tasks, especially the longest task, the join process will be prolonged.
31+
32+
33+
:::{figure-md} fig-hpc-fork
34+
<img src="figs/fork.png" width="80%">
35+
36+
Serial and parallel execution of a set of tasks.
37+
:::
38+
39+
### Speedup
40+
41+
The maximal achievable speedup $\mf s$, w.r.t. a serial execution, on $\mf n$ computing units is limited by the fraction of the execution that runs in parallel $\mf p$. This is reflected by Amdahl's law, see equation {eq}`eq-amdahls-law` and {numref}`fig-hpc-amdahl`.
42+
43+
$$
44+
s(n, p) = \frac{1}{p+\frac{1}{n}(1-p)}
45+
$$(eq-amdahls-law)
46+
47+
:::{figure-md} fig-hpc-amdahl
48+
<img src="figs/amdahl.png" width="80%">
49+
50+
Visualisation for Amdah's law.
51+
:::
52+
53+
The speedup may even decrease, as some operations become more expensive with increasing number of participating processes, e.g. synchronisation points, global data exchange or broadcasts.
54+
55+
In cases, where a large number of execution units is involved and the tasks can not be well decomposed, an unequal work load per process may become the main problem. The execution of one iteration is in general determined by the longest task execution time.
56+
57+
:::{figure-md} fig-hpc-load-balance
58+
<img src="figs/load_balance.png" width="80%">
59+
60+
Example of load balancing.
61+
:::
62+
63+
### Parallelisation
64+
65+
In general, there is a zoo of possibilities to parallelise an application. In FDS, two of them are utilised:
66+
* [Message Passing Interface (MPI)](https://en.wikipedia.org/wiki/Message_Passing_Interface), and
67+
* [Open Multi-Processing (OpenMP or OMP)](https://en.wikipedia.org/wiki/OpenMP).
68+
69+
MPI is a language-independent communication protocol used to develop parallel computerprograms and has its main application in distributed memory machines. The interface allows to explicitly exchange messages between processes. Therefore it is up to the programmer to design appropriate data structures and algorithms to decompose the application.
70+
71+
In CFD, typically the computational domain is explicitly decomposed and distributed over all MPI processes. For stencil operations only neighbour cells are needed, thus only a domain's halo needs to be shared.
72+
73+
Typically the data exchange (compose, transfer, decompose) produces the most overhead. This overhead is purely due to the explicit problem decomposition.
74+
75+
:::{figure-md} fig-hpc-domain-mpi
76+
<img src="figs/decomp_mpi.png" width="40%">
77+
78+
Domain decomposition of a simple computational grid, like in FDS. The data communication (arrows) is handled with MPI.
79+
:::
80+
81+
As MPI is only a definition standard, there is not a single implementation of MPI. The most common MPI implementations are:
82+
83+
* Open MPI, open source and free
84+
* MPICH, open source and free
85+
* Intel MPI Library (MPICH derivate), commercial
86+
87+
The MPI implementations include C/C++/FORTRAN header, libraries, compiler wrapper and a runtime environment.
88+
89+
In contrast to MPI, OpenMP (or short OMP) is a standard to implicitly distribute the execution on a shared memory system.
90+
91+
OMP is implemented as pragmas for C/C++/FORTRAN, and are therefore non intrusive. The typical application is to decompose loops and advise the compiler or the runtime to distribute the loop iterations. The data is not distributed and stays in place, see {numref}`fig-hpc-domain-omp`.
92+
93+
:::{figure-md} fig-hpc-domain-omp
94+
<img src="figs/decomp_openmp.png" width="40%">
95+
96+
Domain decomposition with OpenMP.
97+
:::
98+
99+
However, the compiler or runtime must detect independent tasks. Loop carried dependencies may prevent parallelisation and data races may produce unpredictable results.
100+
101+
To run an OMP application, there are a few handy environment variables
102+
103+
* `OMP_NUM_THREADS`, number of OMP threads
104+
* `OMP_STACKSIZE`, the size of the stack for each thread
105+
106+
The execution command is basically equal to the serial execution. A combination with MPI is possible and the OMP environment variables should be passed to the MPI processes. The hybrid approach (MPI+OMP) is used to reduce the number of MPI processes and therefore the overhead of non-scaling operations. Some details and benchmarks of the OMP and MPI+OMP implementation is given in {cite}`Haarhoff.2014`.
107+
108+
109+
110+
## Parallel Computer
111+
112+
### Components
113+
114+
Parallel computers are often called, e.g., supercomputer, high performance computers or compute clusters. The configuration of a common modern parallel computer contains to following elements, see also {numref}`fig-hpc-cluster`.
115+
116+
* **Compute nodes** are the nodes of the communication network. They consist of (multiple) CPUs, accelerators (e.g. GPU), memory and network interfaces.
117+
* **Central Processing Units (CPU)** do the actual computations. Modern CPUs have multiple physical cores, e.g. the AMD EPYC 7452 has 32 cores. Additionally, a node can support multiple processors, each with an own socket.
118+
* **Accelerators** are special hardware, which are capable of solving selected algorithms faster than a CPU. A common representative are Graphics Processing Units (GPU).
119+
* An **Interconnect** enables a fast communication between the nodes. In general, there are multiple networks, e.g. an Ethernet (e.g. 10~Gbit/s) control network and a fast InfiniBand (e.g. 200~Gbit/s) network for data exchange during computation.
120+
* **Parallel file systems** store the computed data on RAID-systems, which are accessable by all nodes.
121+
* **Login nodes** provide a terminal for users to issue computing jobs and data transfer.
122+
* **Batch systems** control the distribution of user issued jobs on the whole system. Typicall, users define which ressources (e.g. number of node, execution time) are needed for a job and then the job is executed by the batch system, once the required ressources become available.
123+
124+
:::{figure-md} fig-hpc-cluster
125+
<img src="figs/cluster.png" width="40%">
126+
127+
A very generalistic representation of an element / node of a cluster.
128+
:::
129+
130+
### Performance
131+
132+
One way to express the computing power of a computer system, is to estimate the number of floating point operations (FLOP) per second. This theoretical value represents the peak performance. There exist multiple benchmarks to establish an application oriented value and one of them is a benchmark based on the [LINPACK library](https://en.wikipedia.org/wiki/LINPACK) for linear algebra. Although this benchmark may in many cases not represent the performance of indivudual applications, it is commonly used for a global comparison of computer systems, like in the [TOP500 list](https://www.top500.org).
133+
134+
An interesting representation of the computer evolution is given by the [performance develpment of the TOP500 List](https://www.top500.org/statistics/perfdevel/). It further demonstrates the fast technological development, e.g. a notebook with a theoretical peak performance of about 200~GFLOP/s would head the TOP500 list in 1996 and would be still on the list till 2002.
135+
136+
### JURECA
137+
138+
The system, which is used for this lecture is [JURECA](https://fz-juelich.de/ias/jsc/EN/Expertise/Supercomputers/JURECA/JURECA_node.html) at the Forschungszentrum Jülich.
139+
140+
:::{figure-md} fig-hpc-jureca-dc
141+
<img src="https://fz-juelich.de/SharedDocs/Bilder/IAS/JSC/EN/galeries/JURECA/JURECA-DC.jpg?__blob=poster" width="80%">
142+
143+
JURECA-DC cluster at the Forschungszentrum Jülich. Source: Forschungszentrum Jülich.
144+
:::
145+
146+
### CoBra Cluster
147+
148+
Within the BMBF-funded project CoBra, the CCE chair will setup a computing cluster with a theoretical performance of about 100 TFLOP/s in 2021.
149+
150+
:::{figure-md} fig-hpc-cobra-logo
151+
<img src="./figs/cobra_logo_full.svg" width="80%">
152+
153+
CoBra project logo – an image of the cluster is yet to be included.
154+
:::
17 KB
Binary file not shown.
1.31 MB
Loading
29.8 KB
Binary file not shown.
499 KB
Loading

book/content/tools/02_hpc/figs/cobra_logo_full.svg

Lines changed: 52 additions & 0 deletions
Loading
75.4 KB
Binary file not shown.
412 KB
Loading

0 commit comments

Comments
 (0)