-
Notifications
You must be signed in to change notification settings - Fork 19
PFIO: a High Performance Client Server I O Layer
GEOS-5 related applications (such as GEOSgcm
, GEOSctm
, GEOSldas
, GCHP
, etc.) produce a lot of output files that consist of several file collections that are created at different time frequencies.
As the model resolution increases, the amount of data produced significantly grows, and may become overwhelming for the file system especially if one processor is in charge of writing out all files.
PFIO, a subcomponent of the MAPL package was designed to facilitate the production of model output files (organized in collections) in a distributed computing environment. PFIO asynchronously creates output files therefore allowing the model to proceed with calculations without waiting for the I/O to be completed. This leads to a decrease of the overall model integration time.
In the context of GEOS-5, the available nodes (cores) are split into two groups:
- The computing nodes that are reserved for model calculations. The nodes contain cores that are called
Clients
here. - The I/O nodes that are grouped to form the PFIO Server. For reading files, we use the name
Iserver
and when we create outputs, we use insteadOserver
. In this presentation, we will focus only on theOserver
.
All the file collections to be generated by the MAPL HISTORY
(MAPL_History) gridded component are routed through the PFIO server that will distribute the output files to the I/O nodes (based on the user's configuration set at run time).
One of the features of PFIO is that it can be set to run the standard Message Passing Interface (MPI) root processor configuration.
This can be important if the model is integrated at low resolution and/or generates a few file collections.
In this document, we explain when and how to configure the PFIO Server to run on separate resources. We also provide general recommendations on how to properly configure the PFIO Server in order to get the best possible performance. It is important to note that it is up to users to run their application multiple times to determine the optimal PFIO Server configuration.
If users are not aware of the PFIO features and capabilities, they will run their application using MPI processes. The PFIO Server will then run on the same MPI resources as the application. Each time HISTORY is executed, it will not return until the process of writing the data into files (at that particular HISTORY execution) is completed. All the data aggregation and writing is done on the same MPI tasks as the rest of the application. The model calculations cannot proceed until all output procedures for that step are finished. There is no asynchrony or overlap between computations and outputs in this case.
Internally, here are the different PFIO Server steps:
- The
Clients
send the data toOserver
. - All processors in
Oserver
would coordinate to create different shared memory windows for different collections. - The processors use one-sided
MPI_PUT
to send the data to the shared memory. - Different collections are written by different processors. Those writing processors are distributed among nodes as evenly as possible.
- All the other processors have to wait for the writing processors to finish their jobs before responding to
Clients
’ next round of requests.
This configuration of PFIO is suitable when the model runs at low resolutions or if there are a few file collections to produce. If you are for instance running GEOS AGCM at c24/c48/c90 resolution for development purposes with a modest HISTORY output on 2 or 3 nodes, there is no need to dedicate any extra resources for the PFIO Server.
If executable_file
is the executable file, we can issue the regular mpirun
(same for mpiexec
) command:
mpirun -np npes executable_file
where npes
is the number of processors.
In this case, the MpiServer
is used as Oserver
.
The Client
processes are overlapping with Oserver
processes.
The Client
and Oserver
are sequentially working together.
When Client
sends data, it actually makes a copy, then the Oserver
takes over the work,
i.e., shuffling data and writing data to the disk. After MpiServer
is done, the Client
moves on.
For exploiting asynchronous output when using HISTORY, we recommend using the MultiGroupServer
option for the PFIO Server.
With PFIO Server, the model (or application) does not write the data to the disk directly.
Instead the user launches the application on more MPI tasks than is needed for the application.
The extra MPI tasks are dedicated to running the the PFIO Server. When the user chooses the MultiGroupServer
option, the server is itself split into a frontend
and backend
. Only the backend
actually writes to disk.
The frontend
of the server functions as a memory buffer.
When HISTORY decides it is time to write, the data is processed if necessary (regridding for example) to the final form. Then the data is forwarded from the application MPI ranks to the "front end" of the server which is on a different set of MPI ranks. As soon as the data is forwarded the model continues.
Once all the data has been received by the frontend
of the server, the data is forwarded to the backend
on yet a different set of MPI ranks. In the currently implementation each collection to be written is forwarded to a single processor on the backend based on what are available. Note that some may still be writing from the previous write request. That's fine as long as there are still some resources on the backend available. Also note that this implies a collection must fit in a single node memory.
PFIO follows these steps in the execution of the MultiGroupServer
option:
- The
Oserver
is divided into frontend and backend. -
When the frontend receive the data, its root process asks
backend
‘s root (or head) for an idle process for each collection. Then it broadcasts the info to the otherfrontend
processes. - When the
frontend
processors forward (MPI_SEND
) the data to the backend ( different collections to differentbackend
processors), they get back to the clients without waiting for the actual writing.
There are many options to configure the Oserver
.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2
- Note that
$npes$ is not necessary equal to$n1+n2$ . - The
client
(model) will use the minimum number of nodes that contain$n1$ cores.- For example, if each node has
n
processors, then$npes = \lceil \frac{n1}{n} \rceil \times n + n$ .
- For example, if each node has
- If
--isolate_nodes
is set to false (by default, it is true), theoserver
andclient
can co-exist in the same node, and$npes = n1 + n2$ . -
--npes_output_server n2
can be replaced by--nodes_output_server n2
. Then the$npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$ .
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 --oserver_type multigroup --npes_backend_pernode n3
- For each node of oserver,
$n3$ processes are used as backend. - For example, if each node has
$n$ cores, then$npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$ . - The frontend has
$n2 \times (n-n3)$ processes and the backend has$n3 \times n$ processes. - The frontend has
$\lceil \frac{n2}{n} \rceil \times (n-n3)$ processes and the backend has$n3 \times n$ processes.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 n3 n4
- The command creates
$n2$ -node,$n3$ -nodes and$n4$ -nodesMpiServer
. - The
oservers
are independent. The client would take turns to send data to differentoservers
. - If each node has
$n$ processors, then$npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$ . -
Advantage: Since the
oservers
are independent, theclient
has the choice to send the data to the idleoserver
. -
Disavantage: Finding an idle
oserver
is not easy.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 n3 n4 --oserver_type multigroup --npes_backend_pernode n5
- The command creates
$n2$ -node,$n3$ -nodes and$n4$ -nodesMultiGroupServer
. - The
oservers
are independent. Theclient
would take turns to send data to differentoservers
. - If each node has
$n$ processors, then$npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$ . - Each
oserver
has$n2 \times n5$ ,$n3 \times n5$ , and$n4 \times n5$ backend processes respectively.
mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 --one_node_output true
- The option
--one_node_output true
makes it easy to createn2
oservers and each is one-node oserver. - It is equivalent to
--nodes_output_server 1 1 1 1 1 ...
withn2
“1”s.
--fast_oclient true
- After the client sends history data to the
Oserver
, by default it waits and makes sure all the data is sent even it uses non-blockingisend
. If this option is set to true, the client copies the data before non-blockingisend
. It waits and cleans up the copies next time when it re-uses theOserver
.
For the best performance, users should try different configurations of PFIO for a specific run.
They will generally find that after several trials they will hit a limit where the wall-clock time does not decrease despite adding more resources.
In general, there is a "reasonable" estimated configuration for users to start with.
If you run a model requiring NUM_MODEL_PES
of cores, each node has NUM_CORES_PER_NODE
, the total number of history collections is NUM_HIST_COLLECTION
, then
All above number should round up to an integer.
The run command line would look like
mpirun -np TOTAL_PES ./GEOSgcm.x --npes_model NUM_MODEL_PES --nodes_output_server O_NODES --oserver_type multigroup --npes_backend_pernode NPES_BACKEND
The PFIO source code comes with a standalone test program:
MAPL/Tests/pfio_MAPL_demo.F90
that exercises the features of PFIO.
This program is written to mimic the execution steps of MAPL_Cap and can be used as reference to use PFIO in a non-GEOS application.
It writes several time records of 2D and 3D arrays.
The compilation of the program generates the executable named pfio_MAPL_demo.x
.
If we reserve 2 haswell
nodes (28 cores in each), run the model on 28 cores and use 1 MultiGroup
with 5 backend processes, then the execution command is:
mpiexec -np 56 pfio_MAPL_demo.x --npes_model 28 --oserver_type multigroup --nodes_output_server 1 --npes_backend_pernode 5
- The frontend has
$28-5=23$ processes and the backend has$5$ processes.
We create a collection that contains:
- one 2D variable (
IMxJM
) - one 3D variable (
IMxJMxKM
)
Three (3) 'daily' files are written out and each of them contains six (6) time records. We measure the time to perform the IO operations. Note that no calculations are involved here. We only do the array initialization.
PFIO has a profiling tool which is exercised by passing the command line option: --with_io_profiler true
mpiexec -np 56 $MAPLBIN/pfio_MAPL_demo.x --npes_model 28 --oserver_type multigroup --nodes_output_server 1 --npes_backend_pernode 5 --with_io_profiler true
It returns the following timing statistics:
- Inclusive: all time spent between start and stop of a given timer.
- Exclusive: all time spent between start and stop of a given timer _except_ time spent in any other timers.
-
o_server_front
: -
--wait_message
: Time while the front ends is waiting for the data from application. -
--add_Histcollection
: Time for adding history collections. -
--receive_data
: The total time Frontends receive data from applications. -
----collection_1
: The time Frontends receive collection_1. -
--forward_data
: The total time Frontends forward data to Backend. -
----collection_1
: The time Frontends forward collection_1. -
--clean up
: The time finalizing o-server.
IM=360 JM=181 KM=72
with 5 Backend PEs/node
=============
Name Inclusive % Incl Exclusive % Excl Max Excl Min Excl Max PE Min PE
i_server_client 0.324201 100.00 0.324201 100.00 0.520954 0.245613 0016 0023
Final profile
=============
Name Inclusive % Incl Exclusive % Excl Max Excl Min Excl Max PE Min PE
o_server_front 0.357244 100.00 0.053738 15.04 0.881602 0.013470 0000 0002
--wait_message 0.047207 13.21 0.047207 13.21 0.052244 0.040038 0011 0013
--add_Histcollection 0.003346 0.94 0.003346 0.94 0.005641 0.000294 0002 0007
--receive_data 0.194778 54.52 0.000496 0.14 0.000696 0.000367 0013 0019
----collection_1 0.194282 54.38 0.194282 54.38 0.421234 0.113870 0013 0021
--forward_data 0.057849 16.19 0.017939 5.02 0.051281 0.000058 0020 0018
----collection_1 0.039910 11.17 0.039910 11.17 0.048129 0.030721 0018 0019
--clean up 0.000325 0.09 0.000325 0.09 0.000529 0.000244 0009 0017
In the table below, we report the Inclusive time for the two main IO components as the number of backend PEs per node varies:
Number of Backend PEs/node | i_server_client | o_server_front |
---|---|---|
1 | ||
2 | 1.186932 | 1.813097 |
3 | 0.291334 | 1.216281 |
4 | 0.259511 | 0.296956 |
5 | 0.324201 | 0.357244 |
IM=720 JM=361 KM=72
with 5 Backend PEs/node
=============
Name Inclusive % Incl Exclusive % Excl Max Excl Min Excl Max PE Min PE
i_server_client 1.050624 100.00 1.050624 100.00 1.515223 0.822786 0015 0025
Final profile
=============
Name Inclusive % Incl Exclusive % Excl Max Excl Min Excl Max PE Min PE
o_server_front 1.250806 100.00 0.128693 10.29 2.737311 0.008478 0000 0012
--wait_message 0.108261 8.66 0.108261 8.66 0.130712 0.081595 0008 0022
--add_Histcollection 0.003061 0.24 0.003061 0.24 0.004589 0.001020 0004 0002
--receive_data 0.789012 63.08 0.000642 0.05 0.000909 0.000484 0013 0019
----collection_1 0.788370 63.03 0.788370 63.03 1.568300 0.406615 0013 0021
--forward_data 0.221412 17.70 0.102570 8.20 0.378546 0.000081 0021 0018
----collection_1 0.118842 9.50 0.118842 9.50 0.145169 0.090811 0013 0021
--clean up 0.000367 0.03 0.000367 0.03 0.000552 0.000256 0004 0012
In the table below, we report the Inclusive time for the two main IO components as the number of backend PEs per node varies:
Number of Backend PEs/node | i_server_client | o_server_front |
---|---|---|
1 | ||
2 | 3.378511 | 5.795466 |
3 | 0.977153 | 6.262224 |
4 | 1.009190 | 1.203735 |
5 | 1.050624 | 1.250806 |