Skip to content

PFIO: a High Performance Client Server I O Layer

JulesKouatchou edited this page Sep 16, 2022 · 24 revisions


GEOS-5 related applications (such as GEOSgcm, GEOSctm, GEOSldas, GCHP, etc.) produce a lot of outputs file that consist of several file collectinos that are created at different time frequencies. As the model resolution increases, the amount of data produced significantly grows, exacerbating the file system specially if one processor is in charge of writing out all files.

PFIO, a subcomponent of the MAPL package was designed to facilitate the productions of model output files (organized in collections) in a distributed computing environment. PFIO asynchonically creates output files therfore allowing the model to proceed with calculations without waiting for the I/O to be completed. This leads to a decrease of the overall model integration time.

In the context of GEOS-5, the available nodes (cores) are split into two groups:

  • The computing nodes that are reserved for model calculations. The nodes contain cores that are called Clients here.
  • The I/O nodes that are grouped to form the PFIO Server. For reading files, we use the name Iserver and when we create outputs, we use instead Oserver. In this presentation, we will focus only on the Oserver.

All the file collections to be generated by the MAPL HISTORY (MAPL_History) subcomponent, are routed through the PFIO server that will distribute the output files to the I/O nodes (based on the user's configuration set at run time). One of the features of PFIO is that it can be set to run the standard Message Passing Interface (MPI) root processor configuration. This can be important if the model is integrated at low resolution and/or generates a few file collections.

In this document, we explain how and when to configure the PFIO Server to run on separate resources. We also provide general recommendations on how to properly configure the PFIO Server in order to get the best possible performance. It is important to note that it is up to users to run their application multiple times to determine the optimal PFIO Server configuration.

Types of Oserver

Simple Server or MpiServer Class

If users is not aware of the PFIO features and capabilities, they will run their application using MPI processes. The PFIO Server will then run on the same MPI resources as the application. Each time HISTORY is executed, it will not return until the process of writing the data into files (at that particular HISTORY execution) is completed. All the data aggregation and writing is done on the same MPI tasks as the rest of the application. The model calculations cannot proceed until all output procedures for that step are finished. There is no asynchronicity or overlap between computations and outputs in this case.

Internally, here are the different PFIO Server steps:

  • The Clients send the data to Oserver.
  • All processors in Oserver would coordinate to create different shared memory windows for different collections.
  • The processors use one-sided MPI_PUT to send the data to the shared memory.
  • Different collections are written by different processors. Those writing processors are distributed among nodes as evenly as possible.
  • All the other processors have to wait for the writing processors to finish their jobs before responding to Clients’ next round of requests.

This configuration of PFIO is suitable when the model runs at low resolutions or if there are a few file collections to produce. If you are for instance running GEOS AGCM at c24/c48/c90 resolution for development purposes with a modest HISTORY output on 2 or 3 nodes, there is no need to dedicate any extra resources for the PFIO Server.


Command Line

If executable_file is the executable file, we can issue the regular mpirun (same for mpiexec) command:

    mpirun -np npes executable_file

where npes is the number of processors. In this case, the MpiServer is used as Oserver. The Client processes are overlapping with Oserver processes. The Client and Oserver are sequentially working together. When Client sends data, it actually makes a copy, then the Oserver takes over the work, i.e., shuffling data and writing data to the disk. After MpiServer is done, the Client moves on.

MultiGroupServer Class

For exploiting asynchronous output when using HISTORY, we recommend using the MultiGroupServer option for the PFIO Server. With PFIO Server, the model (or application) does not write the data to the disk directly. Instead the user launches the application on more MPI tasks than is needed for the application. The extra MPI tasks are dedicated to running the the PFIO Server. When the user chooses the MultiGroupServer option, the server is itself split into a frontend and backend. Only the backend actually writes to disk.

The frontend of the server functions as a memory buffer. When HISTORY decides it is time to write, the data is processed if necessary (regridding for example) to the final form. Then the data is forwarded from the application MPI ranks to the "front end" of the server which is on a different set of MPI ranks. As soon as the data is forwarded the model continues.

Once all the data has been received by the frontend of the server, the data is forwarded to the backend on yet a different set of MPI ranks. In the currently implementation each collection to be written is forwarded to a single processor on the backend based on what are available. Note that some may still be writing from the previous write request. That's fine as long as there are still some resources on the backend available. Also note that this implies a collection must fit in a single node memory.

PFIO follows these steps in the execution of the MultiGroupServer option:

  • The Oserver is divided into frontend and backend.
  • When the frontend receive the data, its root process asks backend‘s root (or head) for an idle process for each collection. Then it broadcasts the info to the other frontend processes.
  • When the frontend processors forward (MPI_SEND) the data to the backend ( different collections to different backend processors), they get back to the clients without waiting for the actual writing.



Command Line

There are many options to configure the Oserver.

n1 processes for the model and n2 processes for the MpiServer

    mpirun -np npes executable_file –npes_model n1 --npes_output_server n2
  • Note that $npes$ is not necessary equal to $n1+n2$.
  • The client (model) will use the minimum number of nodes that contain $n1$ cores.
    • For example, if each node has n processors, then $npes = \lceil \frac{n1}{n} \rceil \times n + n$.
  • If --isolate_nodes is set to false (by default, it is true), the oserver and client can co-exist in the same node, and $npes = n1 + n2$.
  • --npes_output_server n2 can be replaced by --nodes_output_server n2. Then the $npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$.

n1 processes for the model and n2 processes for the MultiGroupServer

    mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 --oserver_type multigroup --npes_backend_pernode n3
  • For each node of oserver, $n3$ processes are used as backend.
  • For example, if each node has $n$ cores, then $npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$.
  • The frontend has $n2 \times (n-n3)$ processes and the backend has $n3 \times n$ processes.
  • The frontend has $\lceil \frac{n2}{n} \rceil \times (n-n3)$ processes and the backend has $n3 \times n$ processes.

Passing a vector of oservers

    mpirun -np npes executable_file –npes_model n1  --npes_output_server n2 n3 n4
  • The command creates $n2$-node, $n3$-nodes and $n4$-nodes MpiServer.
  • The oservers are independent. The client would take turns to send data to different oservers.
  • If each node has $n$ processors, then $npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$.
  • Advantage: Since the oservers are independent, the client has the choice to send the data to the idle oserver.
  • Disavantage: Finding an idle oserver is not easy.

Passing a vector of oservers and the MultiGroupServer

    mpirun -np npes executable_file –npes_model n1  --npes_output_server n2 n3 n4 --oserver_type multigroup --npes_backend_pernode n5
  • The command creates $n2$-node, $n3$-nodes and $n4$-nodes MultiGroupServer.
  • The oservers are independent. The client would take turns to send data to different oservers.
  • If each node has $n$ processors, then $npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$.
  • Each oserver has $n2 \times n5$, $n3 \times n5$, and $n4 \times n5$ backend processes respectively.

MpiServer using one-sided MPI_PUT and shared memory

   mpirun -np npes executable_file –npes_model n1 --npes_output_server n2 --one_node_output true
  • The option --one_node_output true makes it easy to create n2 oservers and each is one-node oserver.
  • It is equivalent to --nodes_output_server 1 1 1 1 1 ... with n2 “1”s.

Additional Options

--fast_oclient true

  • After the client sends history data to the Oserver, by default it waits and makes sure all the data is sent even it uses non-blocking isend. If this option is set to true, the client copies the data before non-blocking isend. It waits and cleans up the copies next time when it re-uses the Oserver.


For the best performance, users should try different configurations of PFIO for a specific run. They will generally find that after several trials they will hit a limit where the wall-clock time does not decrease despite adding more resources. In general, there is a "reasonable" estimated configuration for users to start with. If you run a model requiring NUM_MODEL_PES of cores, each node has NUM_CORES_PER_NODE, the total number of history collections is NUM_HIST_COLLECTION, then


All above number should round up to an integer.

The run command line would look like

mpirun -np TOTAL_PES ./GEOSgcm.x --npes_model NUM_MODEL_PES --nodes_output_server O_NODES --oserver_type multigroup --npes_backend_pernode NPES_BACKEND

Example: Exercising PFIO in a Standalone Code

The file pfio_MAPL_demo.F90 is a standalone program that implement the use of PFIO. It writes several time records of 2D and 3D arrays. The compilation of the program generates the executable, pfio_MAPL_demo.x. If we reserve 2 haswell nodes (28 cores in each), run the model on 28 cores and use 1 MultiGroup with 5 backend processes, then the execution command is:

    mpiexec -np 56 pfio_MAPL_demo.x --npes_model 28 --oserver_type multigroup --nodes_output_server 1 --npes_backend_pernode 5
  • The frontend has $28-5=23$ processes and the backend has $5$ processes.
Clone this wiki locally