-
Notifications
You must be signed in to change notification settings - Fork 19
PFIO: a High Performance Client Server I O Layer
GEOS-5 related applications (such as GEOSgcm
, GEOSctm
, GEOSldas
, GCHP
, etc.) produce a lot of outputs file that consist of several file collectinos that are created at different time frequencies.
As the model resolution increases, the amount of data produced significantly grows, exacerbating the file system specially if one processor is in charge of writing out all files.
PFIO, a subcomponent of the MAPL package was designed to facilitate the productions of model output files (organized in collections) in a distributed computing environment. PFIO asynchonically creates output files therfore allowing the model to proceed with calculations without waiting for the I/O to be completed. This leads to a decrease of the overall model integration time.
In the context of GEOS-5, the available nodes (cores) are split into two groups:
- The computing nodes that are reserved for model calculations. The nodes contain cores that are called
Clients
here. - The I/O nodes that are grouped to form the PFIO Server. For reading files, we use the name
Iserver
and when we create outputs, we use insteadOserver
. In this presentation, we will focus only on theOserver
.
All the file collections to be generated by the MAPL HISTORY
(MAPL_History) subcomponent, are routed through the PFIO server that will distribute the output files to the I/O nodes (based on the user's configuration set at run time).
One of the features of PFIO is that it can be set to run the standard Message Passing Interface (MPI) root processor configuration.
This can be important if the model is integrated at low resolution and/or generates a few file collections.
In this document, we explain how and when to configure the PFIO Server to run on separate resources. We also provide general recommendations on how to properly configure the PFIO Server in order to get the best possible performance. It is important to note that it is up to users to run their application multiple times to determine the optimal PFIO Server configuration.
If users is not aware of the PFIO features and capabilities, they will run their application using MPI processes. The PFIO Server will then run on the same MPI resources as the application. Each time HISTORY is executed, it will not return until the process of writing the data into files (at that particular HISTORY execution) is completed. All the data aggregation and writing is done on the same MPI tasks as the rest of the application. The model calculations cannot proceed until all output procedures for that step are finished. There is no asynchronicity or overlap between computations and outputs in this case.
Internally, here are the different PFIO Server steps:
- The
Clients
send the data toOserver
. - All processors in
Oserver
would coordinate to create different shared memory windows for different collections. - The processors use one-sided
MPI_PUT
to send the data to the shared memory. - Different collections are written by different processors. Those writing processors are distributed among nodes as evenly as possible.
- All the other processors have to wait for the writing processors to finish their jobs before responding to
Clients
’ next round of requests.
This configuration of PFIO is suitable when the model runs at low resolutions or if there are a few file collections to produce. If you are for instance running GEOS AGCM at c24/c48/c90 resolution for development purposes with a modest HISTORY output on 2 or 3 nodes, there is no need to dedicate any extra resources for the PFIO Server.
If executable_file
is the executable file, we can issue the regular mpirun
(same for mpiexec
) command:
mpirun -np npes executable_file
where npes
is the number of processors.
In this case, the MpiServer
is used as Oserver
.
The Client
processes are overlapping with Oserver
processes.
The Client
and Oserver
are sequentially working together.
When Client
sends data, it actually makes a copy, then the Oserver
takes over the work,
i.e., shuffling data and writing data to the disk. After MpiServer
is done, the Client
moves on.
For exploiting asynchronous output when using HISTORY, we recommend using the MultiGroupServer
option for the PFIO Server.
With PFIO Server, the model (or application) does not write the data to the disk directly.
Instead the user launches the application on more MPI tasks than is needed for the application.
The extra MPI tasks are dedicated to running the the PFIO Server. When the user chooses the MultiGroupServer
option, the server is itself split into a frontend
and backend
. Only the backend
actually writes to disk.
The frontend
of the server functions as a memory buffer.
When HISTORY decides it is time to write, the data is processed if necessary (regridding for example) to the final form. Then the data is forwarded from the application MPI ranks to the "front end" of the server which is on a different set of MPI ranks. As soon as the data is forwarded the model continues.
Once all the data has been received by the frontend
of the server, the data is forwarded to the backend
on yet a different set of MPI ranks. In the currently implementation each collection to be written is forwarded to a single processor on the backend based on what are available. Note that some may still be writing from the previous write request. That's fine as long as there are still some resources on the backend available. Also note that this implies a collection must fit in a single node memory.
PFIO follows these steps in the execution of the MultiGroupServer
option:
- The
Oserver
is divided into frontend and backend. -
When the frontend receive the data, its root process asks
backend
‘s root (or head) for an idle process for each collection. Then it broadcasts the info to the otherfrontend
processes. - When the
frontend
processors forward (MPI_SEND
) the data to the backend ( different collections to differentbackend
processors), they get back to the clients without waiting for the actual writing.
There are many options to configure the Oserver
.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2
- Note that
$npes$ is not necessary equal to$n1+n2$ . - The
client
(model) will use the minimum number of nodes that contain$n1$ cores.- For example, if each node has
n
processors, then$npes = \lceil \frac{n1}{n} \rceil \times n + n$ .
- For example, if each node has
- If
--isolate_nodes
is set to false (by default, it is true), theoserver
andclient
can co-exist in the same node, and$npes = n1 + n2$ . -
--npes_output_server n2
can be replaced by--nodes_output_server n2
. Then the$npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$ .
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 --oserver_type multigroup --npes_backend_pernode n3
- For each node of oserver,
$n3$ processes are used as backend. - For example, if each node has
$n$ cores, then$npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$ . - The frontend has
$n2 \times (n-n3)$ processes and the backend has$n3 \times n$ processes. - The frontend has
$\lceil \frac{n2}{n} \rceil \times (n-n3)$ processes and the backend has$n3 \times n$ processes.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 n3 n4
- The command creates
$n2$ -node,$n3$ -nodes and$n4$ -nodesMpiServer
. - The
oservers
are independent. The client would take turns to send data to differentoservers
. - If each node has
$n$ processors, then$npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$ . -
Advantage: Since the
oservers
are independent, theclient
has the choice to send the data to the idleoserver
. -
Disavantage: Finding an idle
oserver
is not easy.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 n3 n4 --oserver_type multigroup --npes_backend_pernode n5
- The command creates
$n2$ -node,$n3$ -nodes and$n4$ -nodesMultiGroupServer
. - The
oservers
are independent. Theclient
would take turns to send data to differentoservers
. - If each node has
$n$ processors, then$npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$ . - Each
oserver
has$n2 \times n5$ ,$n3 \times n5$ , and$n4 \times n5$ backend processes respectively.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 --one_node_output true
- The option
--one_node_output true
makes it easy to createn2
oservers and each is one-node oserver. - It is equivalent to
--nodes_output_server 1 1 1 1 1 ...
withn2
“1”s.
--fast_oclient true
- After the client sends history data to the
Oserver
, by default it waits and makes sure all the data is sent even it uses non-blockingisend
. If this option is set to true, the client copies the data before non-blockingisend
. It waits and cleans up the copies next time when it re-uses theOserver
.
For the best performance, users should try different configurations of PFIO for a specific run.
They will generally find that after several trials they will hit a limit where the wall-clock time does not decrease despite adding more resources.
In general, there is a "reasonable" estimated configuration for users to start with.
If you run a model requiring NUM_MODEL_PES
of cores, each node has NUM_CORES_PER_NODE
, the total number of history collections is NUM_HIST_COLLECTION
, then
All above number should round up to an integer.
The run command line would look like
mpirun -np TOTAL_PES ./GEOSgcm.x --npes_model NUM_MODEL_PES --nodes_output_server O_NODES --oserver_type multigroup --npes_backend_pernode NPES_BACKEND
The file pfio_MAPL_demo.F90
is a standalone program that implement the use of PFIO.
It writes several time records of 2D and 3D arrays.
The compilation of the program generates the executable, pfio_MAPL_demo.x
.
If we reserve 2 haswell
nodes (28 cores in each), run the model on 28 cores and use 1 MultiGroup
with 5 backend processes, then the execution command is:
mpiexec -np 56 pfio_MAPL_demo.x --npes_model 28 --oserver_type multigroup --nodes_output_server 1 --npes_backend_pernode 5
- The frontend has
$28-5=23$ processes and the backend has$5$ processes.