Skip to content

HISTORY Samplers

JulesKouatchou edited this page Feb 7, 2025 · 41 revisions

$\textcolor{red}{\textbf{Introduction}}$

An observing system simulation experiment (OSSE) is a modeling experiment used to evaluate the value of a new observing system when actual observational data are not available. An OSSE system includes a nature run (Atlas, 1997), a data assimilation system (Atlas et. al, 2015), and software to simulate “observations” from the nature run and to add realistic observation errors. OSSEs are designed to assess the impact of instruments that do not yet exist on numerical weather prediction (NWP) (Boukabara et. al, 2016) and analysis; to make design decisions for a new observing system or network; and to investigate the behavior of data assimilation systems and thereby optimally tune these systems in an environment where the “truth” and hence the system’s behavior is known.

Any OSSE activity starts with a realistic representation of nature, typically by means of a high-resolution simulation by a comprehensive Earth system model without assimilation, the so-called Nature Run (NR). These models are run for a period long enough to capture the relevant natural variability such as the seasonal cycle, and to spin up to a well equilibrated state. Any OSSE needs to have a procedure to extract synthetic observations that mimic the distribution of real observations, and the impacts of synthetic data should be equivalent to the corresponding impacts of real observations. The process of simulating the observations amount to sampling the NR at the appropriate times and locations.

NWP models used in OSSEs generate outputs across a grid system, essentially providing forecast information at specific points in space and time. If we want to obtain data at locations of interest (as seen by instruments), we can use offline techniques such as the Model Output Statistics (MOS) to statistically interpolate the model data to those locations. It is more attractive for models to have the capability to produce fields at any location and any frequency at runtime, instead of doing it offline.

In recent years, ESMF has incorporated robust parallel and scalable functionality for interpolation and regridding. This has allowed the GEOS model to be able to perform these tasks (interpolation and regridding) on the fly during the model integration. Initially, the GEOS model implemented the ability to read input data files and produce output files of different grid types and resolutions (horizontal and vertical). This work was expended with Sampler, a tool to generate data files at the user's prescribed locations (fixed or dynamic). Sampler is a HISTORY subcomponent that maps gridded model geophysical variables onto observation locations, be it fixed ground stations, aircraft trajectories or satellite swath.

With Sampler, we have the ability to configure the entire HISTORY pipeline to directly generate for any GEOS desired quantity at any static or time dependent location (or group of locations) of interest (stations, moving object trajectory, satellite swath, etc.).

In this document, we describe the different options for Sampler and explain how to use each of them while running the GEOS model.

$\textcolor{red}{\textbf{Types of samplers}}$

We present how individual samplers can be exercised and the required settings in the HISTORY.rc file for each of them.

$\textcolor{blue}{\textbf{Station sampler}}$

Is used to produce geophysical variables at a set of time-independent geospatial coordinates corresponding to fixed ground stations (for instance NASA AERONET or NOAA GHCNd land surface stations).

$\textcolor{green}{\textbf{Station sampler: list of stations}}$

The user needs to create a csv file to list all the stations of interest. Each row should have at least the following information:

  • station name
  • station latitude
  • station longitude

The user may specify other parameters (such as the station ID) to add more description of a station as long as all the lines have the same number of columns. Currently, the code supports files with any of the following line contents:

station_id, station_name, station_longitude, station_latitude
station_name, station_id, station_longitude, station_latitude
station_name, station_longitude, station_latitude
station_name, station_latitude, station_longitude

Note

Since the most important parameters are the station name and its position, the source code will be refactored in the future so that the station file could include any number of columns as long as the key parameters are present in a consistent order.

Here is a sample station file:

List of stations from AERONET
name,lon,lat                                                                                                
Anchorage,-149.9,61.2
Atlanta,-84.4,33.7
Greenbelt,-76.9,39.1
Bismarck,-100.8,46.8

It obeys the line formatting:

station_name, station_longitude, station_latitude

$\textcolor{green}{\textbf{Station sampler: settings in HISTORY.rc}}$

The HISTORY.rc file settings for the station sampler follow the same syntax as described in the MAPL History Component document. However, specific parameters are required to be able to exercise the station sampler:

  • sampler_spec: A string that needs to be set to 'station' to select a station sampler collection.
  • station_id_file: Full path to the file containing the list of stations and their locations (latitude and longitude in degrees).
  • station_skip_line: An integer specifying the numbers of lines to skip on top the station file.
  • regrid_method: A string specifying the regridding method (for instance 'BILINEAR', 'CONSERVATIVE') to be used to interpolate the model fields at the different stations.
  COLLECTIONS:                            
  Aeronet                                 
  ::                                                                                                                   
                                          
  Aeronet.sampler_spec: 'station'         
  Aeronet.station_id_file:   FULL_PATH/my_station_file.csv
  Aeronet.station_skip_line:  2           
  Aeronet.template: %y4%m2%d2_%h2%n2.nc4
  Aeronet.format: 'CFIO'                  
  Aeronet.frequency: 001000,  
  Aeronet.duration:  240000,   
  Aeronet.regrid_method:     'BILINEAR' ,
  Aeronet.fields: 'PHIS'       , 'AGCM'       , 'phis'       ,
                  'TROPT'      , 'AGCM'       ,    
                  'TS'         , 'SURFACE'    , 'ts'         , 
                  'TSOIL1'     , 'SURFACE'    ,   
                  'PS'         , 'DYN'        , 'ps'         ,    
                  'Q'          , 'MOIST'      , 'sphu'       ,
::

$\textcolor{blue}{\textbf{Trajectory sampler}}$

The trajectory sampler is used to produce any geophysical variables at time-dependent geospatial specific points along a defined latitude-longitude path or trajectory (corresponding to tracks of aircraft, balloons, ships or nadir-viewing spaceborne assets). The goal is to provide a snapshot of atmospheric conditions as an object would experience them while moving through that path.

To exercise the trajectory sampler, users must provide in the HISTORY.rc file at least the following information:

  • A list of names of the trajectories to be considered for outputs.
  • The date/time range to produce outputs along trajectories. This applies to all trajectories.
    • The range is specified through two parameters (beginning and end) in the format YYYY-MM-DDThh:mm:ss.
    • The experiment needs to start within that range, otherwise the code will abort.
    • The outputs along a trajectory will only be written out within the range though the simulation may proceed.
  • The frequency of the outputs.
  • For each trajectory:
    • A the full path to a netCDF file template.
      • The code will use the template to point to the actual netCDF file.
      • The netCDF file contains a list of specific geolocated points that the code will use for the trajectory sampler.
    • The list of fields to produce along each provided trajectory. The list is unique to each trajectory.

We will now describe how the information are set in the HISTORY.rc file.

$\textcolor{green}{\textbf{Trajectory sampler: settings in HISTORY.rc}}$

To be able to use the trajectory sampler, it is important to set the following parameters in the HISTORY.rc file in any trajectory collection:

  • sampler_spec: A string that needs to be set to 'trajectory' to select a trajectory sampler collection.
  • ObsPlatforms: list of names (two consecutive names separated by a comma) of the different observation trajectories we want to produce outputs along.
  • obs_file_begin: date/time (in the format YYYY-MM-DDThh:mm:ss) for the beginning of the observation file. If not provided, the code will use the current date/time and will verify that a trajectory file exists on that specific date/time.
  • obs_file_interval: required parameter (in the format: yymmdd hhmmss) providing the date/time interval between two consecutive observation files.
  • obs_file_end: date/time (in the format YYYY-MM-DDThh:mm:ss) for the end of the observation file. If not provided, the code will use the current date/time plus 14 days.
  • Epoch: integer determining the output frequency in hours/minutes/seconds (in the format: hhmmss) .
  • regrid_method: A string specifying the regridding method (for instance 'BILINEAR', 'CONSERVATIVE') to be used to interpolate the model fields at the different stations.

The list of fields to be written out is not included in the the trajectory collection definition as it is customary in the HISTORY.rc file. For each trajectory set in the ObsPlatforms parameter, users must provide additional variables to define at least the observation trajectory file template (file_name_template) and the list of fields (geovals_fields) to produce along the defined trajectory. Assume that obs_traj is one value included in ObsPlatforms, here is a template setting for the corresponding trajectory:

PLATFORM.obs_traj::
  IODA_SCHEMA::
    index_name_x:     Location
    var_name_lon:     MetaData/longitude
    var_name_lat:     MetaData/latitude
    var_name_time:    MetaData/dateTime
    file_name_template:  FULL_PATH/obs_traj.%y4%m2%d2T%h2%n2%S2Z.nc4
  :: 
  GEOVALS_SCHEMA::
    geovals_fields::
      'PHIS'       , 'AGCM'       , 'phis'       ,
      'TROPT'      , 'AGCM'       ,    
      'TS'         , 'SURFACE'    , 'ts'         , 
      'TSOIL1'     , 'SURFACE'    ,   
      'PS'         , 'DYN'        , 'ps'         ,    
      'Q'          , 'MOIST'      , 'sphu'       ,
    ::
  ::
::

Note

Each observation file (derived from file_name_template) needs to have a list locations where each is identified in terms of a latitude, a longitude and a timestamp (date/time). The code will use the information to perform interpolations.

Here is a sample HISTORY.rc file:

  COLLECTIONS:                            
  'jedi'                                 
  ::                                                                                                                   
 
  jedi.sampler_spec:        trajectory                                                           
  jedi.ObsPlatforms:         aircraft atms_npp
  jedi.template:             '%y4%m2%d2_%h2%n2z.nc4',
  jedi.format:               'CFIO',
  jedi.obs_file_begin:       2019-07-31T21:00:00
  jedi.obs_file_interval:    '000000 060000'   
  jedi.obs_file_end:         2019-11-01T00:00:00
  jedi.Epoch:                060000          
  jedi.regrid_method:        'BILINEAR' ,
::  

DEFINE_OBS_PLATFORM::                                        

PLATFORM.aircraft::
  IODA_SCHEMA::
    index_name_x:     Location
    var_name_lon:     MetaData/longitude
    var_name_lat:     MetaData/latitude
    var_name_time:    MetaData/dateTime
    file_name_template:  /discover/nobackup/projects/gmao/aist-nr/data/ioda_reshuffle/%y4%m2%d2/geos_atmosphere/aircraft.%y4%m2%d2T%h2%n2%S2Z.nc4
  ::      
  GEOVALS_SCHEMA::
    geovals_fields::
      'PHIS'       , 'AGCM'       , 'phis'       ,
      'TROPT'      , 'AGCM'       ,    
      'TS'         , 'SURFACE'    , 'ts'         , 
    ::    
  ::      
:: 

PLATFORM.atms_npp::                                                                                IODA_SCHEMA::
    index_name_x:     Location
    var_name_lon:     MetaData/longitude
    var_name_lat:     MetaData/latitude
    var_name_time:    MetaData/dateTime
    file_name_template:  /discover/nobackup/projects/gmao/aist-nr/data/ioda_reshuffle/%y4%m2%d2/geos_atmosphere/atms_npp.%y4%m2%d2T%h2%n2%S2Z.nc4
  ::      
  GEOVALS_SCHEMA::
    geovals_fields::
      'TSOIL1'     , 'SURFACE'    ,   
      'PS'         , 'DYN'        , 'ps'         ,    
      'Q'          , 'MOIST'      , 'sphu'       ,
    ::    
  ::      
:: 

$\textcolor{green}{\textbf{Trajectory sampler: understanding the value of \textbf{Epoch}}}$

The parameter Epoch is set (with the format hhmmss) in the collection definition section of the HISTORY.rc file. It determines the output frequency of the trajectory collection. The value of Epoch influences the experiments in many ways:

  • The trajectory collection can only be produced if the starting time of an experiment is greater than obs_file_begin minus Epoch. If this condition is not met, the code will crash because no observation file will be available to write out data in the trajectory file.
  • The value of Epoch is used by the code to identify the number of observations files available between two consecutive trajectory output time periods. The code reads the files to collect all the locations (lat/lon) and times, and writes in the trajectory collection, fields at those locations/times.

Note

The experiment is not required to end before obs_file_end. As soon as no observation file is available, the code will stop producing trajectory files.

$\textcolor{green}{\textbf{How trajectory sampler outputs are produced}}$

It is important to understand how the model produces the trajectory sampler outputs. There are four HISTORY.rc trajectory sampler parameters that we need to focus on:

  • obs_file_begin: Beginning date (YYYY-MM-DDThh:mm:ss) of the availability of the observation files. At this particular date, there should be one.
  • obs_file_end: End date (YYYY-MM-DDThh:mm:ss) of the availability of the observation files. At this particular date, there should be one.
  • obs_file_interval: date/time interval (yymmdd hhmmss) between two consecutive observation files. This setting indicates what the code should expect. If an observation file does not exist at the expected date, the code has a mechanism to use the nearest file.
  • Epoch: The model output frequency.

The code takes obs_file_begin, obs_file_end, obs_file_interval and the observation file template (file_name_template) to identify all the available observation files. The code will select all the files that fall with an Epoch, pull out all the locations (latitude/longitude/timestamp) from the files, perform data interpolation on all those locations and write the results in a file. The file written out will have the locations with an order based on time. There is an index variable that allows users to match points (latitude/longitude) to times.

$\textcolor{blue}{\textbf{Swath sampler}}$

The swath sampler is used to produce geophysical at time-dependent geospatial coordinates corresponding to the two-dimensional swath of an orbiting instrument. Swaths are typically represented by logically rectangular curvilinear grids that may have higher or lower resolution than the NR. When the swath has lower resolution than the NR, conservative regridding will be performed. However, in cases when the observing system has a much higher resolution than the NR, it maybe more advantageous to use masked samplers and perform any necessary interpolation offline.

$\textcolor{green}{\textbf{Swath sampler: settings in HISTORY.rc}}$

There are two groups of parameters in the HISTORY.rc file that need to be properly set to exercise the swath sampler. The first group is within the GRID_LABELS definition of the swath and the second group is for the swath HISTORY collection.

GRID_LABELS definition

  • GRID_TYPE: The grid type set here to Swath.
  • GRID_FILE: A file template providing the full path to the location of the observation swath file.
  • index_name_lon: Name of the longitude dimension in the observation file.
  • index_name_lat: Name of the latitude dimension in the observation file.
  • var_name_lon: Name of the longitude array in the observation file.
  • var_name_lat: Name of the latitude array in the observation file.
  • tunit: time unit in the format seconds since YYYY-MM-DD hh:mm:ss.
  • obs_file_begin: Beginning date (YYYY-MM-DDThh:mm:ss) of the availability of the observation files. At this particular date, there should be one.
  • obs_file_end: End date (YYYY-MM-DDThh:mm:ss) of the availability of the observation files. At this particular date, there should be one.
  • obs_file_interval: date/time interval (yymmdd hhmmss) between two consecutive observation files. This setting indicates what the code should expect. If an observation file does not exist at the expected date, the code has a mechanism to use the nearest file.
  • Epoch: The model output frequency in the format hhmmss. If not provided or set to 000000, the code will crash.

Swath collection

The settings are the same as in any standard HISTORY.rc collection. Two particular parameters are required and needs our attention:

  • observation_spec: A string that needs to be set to 'Swath' to select a swath sampler collection.
  • frequency: the frequency of the swath output file in the format hhmmss. It should be exactly equal to Epoch in the swath grid definition, otherwise the code will crash.

Here is an example of swath sampler HISTORY.rc file:

  COLLECTIONS:
  SNDR_ATMS
  ::
  
  GRID_LABELS: 
  SwathGrid_ATMS
::
  
  
  SwathGrid_ATMS.GRID_TYPE: Swath
  SwathGrid_ATMS.GRID_FILE: /discover/nobackup/projects/gmao/aist-nr/data/SNDR.J1.ATMS.L1B.v02/Y%y4/%D3/SNDR.J1.ATMS.%y4%m2%d2T%h2%n2*.nc
  SwathGrid_ATMS.LM:  2  
  SwathGrid_ATMS.index_name_lon:  xtrack
  SwathGrid_ATMS.index_name_lat:  atrack
  SwathGrid_ATMS.var_name_lon:    lon
  SwathGrid_ATMS.var_name_lat:    lat
  SwathGrid_ATMS.var_name_time:   obs_time_tai93
  SwathGrid_ATMS.tunit:          'seconds since 1993-01-01 00:00:00'
  SwathGrid_ATMS.obs_file_begin: '2019-08-01T00:00:00'
  SwathGrid_ATMS.obs_file_end:   '2019-11-01T23:00:00'
  SwathGrid_ATMS.obs_file_interval:   '000000 000600'     # yymmdd  hhmmss
  SwathGrid_ATMS.Epoch:           010000                  # hhmmss 

  SNDR_ATMS.observation_spec:  'Swath'
  SNDR_ATMS.template:    '%y4%m2%d2_%h2%n2.nc4',
  SNDR_ATMS.format:      'CFIO',
  SNDR_ATMS.frequency:   010000
  SNDR_ATMS.grid_label:   SwathGrid_ATMS,
  SNDR_ATMS.regrid_method: 'BILINEAR' ,
  SNDR_ATMS.fields:     'PHIS'       , 'AGCM'       , 'phis'       ,
                  'TROPT'      , 'AGCM'       ,    
                  'TS'         , 'SURFACE'    , 'ts'         , 
                  'TSOIL1'     , 'SURFACE'    ,   
                  'PS'         , 'DYN'        , 'ps'         ,    
                  'Q'          , 'MOIST'      , 'sphu'       ,
::

$\textcolor{green}{\textbf{Swath sampler: understanding the value of Epoch}}$

The parameter Epoch is set (with the format hhmmss) in the observation grid definition section of the HISTORY.rc file. It determines the output frequency of the swath collection and needs to be equal to the frequency parameter in the swath collection definition. The value of Epoch influences the experiments in many ways.

First, the swath collection can only be produced if the starting and end times of an experiment fall between obs_file_begin minus Epoch and obs_file_end. If this condition is not met, the code will crash at some point because one observation file will be unavailable to write out data in the swath file.

Secondly, the value of Epoch determines how many observations files will be used to gather locations (longitude and longitude pairs) that will all be included in the swath output file. The larger is Epoch the more locations are employed and the code will need more time and memory to produce the swath file. It is therefore important to select the appropriate value of Epoch that will not slow down the experiments.

Warning

The value of Epoch should be the same as that of frequency. The larger is Epoch the longer it will take to create the swath output file.

$\textcolor{blue}{\textbf{Masked sampler}}$

A masked sampler is used when the observing system has a much higher resolution than the NR. In this case, gridded geophysical variables are masked in such a way that values are preserved at those grid-points that have been visited by the satellite, with possibly the addition of a “halo” for aiding off-line interpolation, with all other grid-points receiving a constant undefined value. These gridded fields can be efficiently output using internal compression algorithms available with most modern formats (e.g., NetCDF-4, HDF-5), or alternatively using a sparse storage scheme.

To exercise this sampler, users must provide a netCDF observation file that has the locations that need to me masked. The task of the code is to identify the non-masked grid-points that correspond to the locations visited by a moving object. The produced files in this collection will contained surface values of the selected fields at the non-masked locations.

Note

The masked sampler applies only for geostationary satellites. Each masked sample collection will be based on the view of a specific satellite.

Key variables for a masked sampler collection

  • sampler_spec: A string that needs to be set to 'mask' to select a masked sampler collection.
  • obs_files: Full path to the netCDF file containing the masked locations.
  • obs_file_begin: Beginning date (YYYY-MM-DDThh:mm:ss) of the availability of the observation files. At this particular date, there should be one.
  • obs_file_end: End date (YYYY-MM-DDThh:mm:ss) of the availability of the observation files. At this particular date, there should be one.
  • obs_file_interval: date/time interval (yymmdd hhmmss) between two consecutive observation files. This setting indicates what the code should expect. If an observation file does not exist at the expected date, the code has a mechanism to use the nearest file.
  • index_name_x: Name of the longitude dimension in the observation file (default is x.
  • index_name_y: Name of the latitude dimension in the observation file (default is y.
  • var_name_x: Name of the longitude array in the observation file (default is x).
  • var_name_y: Name of the latitude array in the observation file (default is y).
  • var_name_proj: Name of the variable providing map projection information.
  • att_name_proj: Attribute for the longitude origin in the map projection.
  • thin_factor: Integer value used to reduce regridding matrix size (default is -1).

Sample HISTORY.rc file

  COLLECTIONS:
   ABI_M6C01_Mask
  ::  

  ABI_M6C01_Mask.sampler_spec:        'mask'          
  ABI_M6C01_Mask.obs_file_begin:      '2019-07-31T00:00:00'
  ABI_M6C01_Mask.obs_file_end:        '2019-11-01T00:00:00'
  ABI_M6C01_Mask.obs_file_interval:   '000000 001000'
  ABI_M6C01_Mask.obs_files:           /discover/nobackup/projects/gmao/aist-nr/data/GOES-X/OR_ABI-L1b-RadF-M6C01_G16_s20192340840216_e20192340849524_c20192340849582.nc
  ABI_M6C01_Mask.index_name_x:         x
  ABI_M6C01_Mask.index_name_y:         y
  ABI_M6C01_Mask.var_name_x:           x
  ABI_M6C01_Mask.var_name_y:           y
  ABI_M6C01_Mask.var_name_proj:        goes_imager_projection
  ABI_M6C01_Mask.att_name_proj:        longitude_of_projection_origin
  ABI_M6C01_Mask.thin_factor:          100
  ABI_M6C01_Mask.template:             '%y4%m2%d2_%h2%n2.nc4',
  ABI_M6C01_Mask.format:               'CFIO',
  ABI_M6C01_Mask.frequency:            001000,
  ABI_M6C01_Mask.duration:             001000,
  ABI_M6C01_Mask.regrid_method:        'BILINEAR' ,
  ABI_M6C01_Mask.fields:       'PHIS'       , 'AGCM'       , 'phis'       ,
                  'TROPT'      , 'AGCM'       ,    
                  'TS'         , 'SURFACE'    , 'ts'         , 
                  'TSOIL1'     , 'SURFACE'    ,   
                  'PS'         , 'DYN'        , 'ps'         ,    
                  'Q'          , 'MOIST'      , 'sphu'       ,
::

Note

There is no restriction on the start and end time of the experiment. The code will produce files for the masked collection during the entire duration of the experiment using the settings of frequency and duration. The code uses the masked locations of the obs_files file to select the grid-points where the field values will be written out as a two dimensional array (time, location).

$\textcolor{red}{\textbf{References}}$

  • Atlas, R., 1997: Atmospheric observations and experiments to assess their usefulness in data assimilation. J. Meteor. Soc. Japan, 75, 111–130, https://doi.org/10.2151/jmsj1965.75.1B_111.
  • Atlas, R., L. Bucci, B. Annane, R. Hoffman, and S. Murillo, 2015: Observing system simulation experiments to assess the potential impact of new observing systems on hurricane forecasting. Mar. Technol. Soc. J., 49, 140–148, https://doi.org/10.4031/MTSJ.49.6.3.
  • Boukabara, S. A., and Coauthors, 2016: Community Global Observing System Simulation Experiment (OSSE) Package (CGOP): Description and usage. J. Atmos. Oceanic Technol., 33, 1759–1777, https://doi.org/10.1175/JTECH-D-16-0012.1.
Clone this wiki locally