Skip to content
This repository was archived by the owner on Oct 30, 2025. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 22 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,25 @@
# global-eagle
# Welcome to Eagle!

Go to examples folder to run the full `ufs2arco + Anemoi + wxvx` pipeline, which includes:
1) Use `ufs2arco` to create training and validation datasets
2) Use `anemoi-core` modules to train a graph-based model
3) Use `anemoi-inference` to run inference
4) Use `wxvx` to verify forecasts
This repository contains various configurations to complete a machine learning pipeline for weather prediction! Various directories will guide you through an entire ML pipeline.

See the examples folder to use 1 year of NOAA Replay Reanalysis data to train a model for 1,000 steps, create a forecast, and verify that forecast.
The key steps include:
1) Data preprocessing using ufs2arco to create training, validation, and test datasets
2) Model training using anemoi-core modules to train a graph-based model
3) Creating a forecast with anemoi-inference to run inference from a model checkpoint
4) Verifying your forecast (or multiple!) with wxvx to verify against gridded analysis or observervations

NOTE: This repository is currently under development.
For more information about each step, please see our documentation: https://global-eagle.readthedocs.io/en/latest/

---------------------

Acknowledgments:

ufs2arco: Tim Smith (NOAA Physical Sciences Laboratory)
- https://github.com/NOAA-PSL/ufs2arco

Anemoi: European Centre for Medium-Range Weather Forecasts
- https://github.com/ecmwf/anemoi-core
- https://github.com/ecmwf/anemoi-inference

wxvx: Paul Madden (NOAA Global Systems Laboratory/Cooperative Institute for Research In Environmental Sciences)
- https://github.com/maddenp-cu/wxvx
35 changes: 26 additions & 9 deletions nested_eagle/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,31 @@
conda
There are two folders within this `nested_eagle` directory:

1) `scientific_workflow` will guide you through a whole ML pipeline to create training data, train a model, run inference, and verify a forecast.
2) `operational_inference` provides scripts to run inference from a checkpoint in near real time. These scripts will assume you have a checkpoint from completing `scientifc_workflow` and want to run a near real time forecast with it.

----------

Before starting anything, you must create two conda environments.
1) `eagle` environment to use for data creation, training, and inference
2) `wxvx` environment to use for verification

These environments have already been made for you on Ursa and can by found by running`source /scratch4/NAGAPE/epic/role-epic/miniconda/bin/activate`

Then, simply activate the environments by running `conda activate eagle` or `conda activate wxvx`

----------

To create the necessary environments yourself, run the following commands:

`eagle` environment to use for data creation, training, and inference:
```
conda create -n eagle python=3.11
conda activate eagle
module load cuda gcc openmpi
conda install -c conda-forge xesmf
conda install -c conda-forge ufs2arco=0.17.1
conda install -c conda-forge matplotlib cartopy cmocean
pip install 'torch<2.7' anemoi-datasets==0.5.26 anemoi-graphs==0.6.4 anemoi-models==0.9.2 anemoi-training==0.6.2 anemoi-inference==0.7.1 anemoi-utils==0.4.35 anemoi-transform==0.1.16
conda env create -f environment.yaml
conda activate eagle
pip install 'flash-attn<2.8' --no-build-isolation
pip install eagle-tools
conda install -c conda-forge esmf=8.7.0=nompi*
```

`wxvx` environment to use for verification:
```
conda create -y -n wxvx -c ufs-community -c paul.madden wxvx -c conda-forge --override-channels
```
4 changes: 2 additions & 2 deletions environment.yaml → nested_eagle/environment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@ channels:
- defaults
dependencies:
- python=3.11
- ufs2arco
- ufs2arco=0.17.1
- "esmf=8.7.0=nompi*"
- pip
- pip:
- torch<2.7
Expand All @@ -15,5 +16,4 @@ dependencies:
- anemoi-inference==0.7.1
- anemoi-utils==0.4.35
- anemoi-transform==0.1.16
- flash-attn<2.8 --no-build-isolation
- eagle-tools
16 changes: 10 additions & 6 deletions nested_eagle/scientific_workflow/data/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
Run `sbatch submit_grids.sh` first.
Run `sbatch submit_grids.sh`
- This creates some static grid files that will be used for regridding later in the pipeline.
- Once this has completed you can move onto dataset creations.
- Note: this creates static files that you can reuse, so if you run through this pipeline multiple times it may not be necessary to always re-run this.

Run `sbatch submit_gfs.sh` next.

Run `sbatch submit_hrrr.sh`.
Run `sbatch submit_gfs.sh` followed by `sbatch submit_hrrr.sh`
- You can run both of these at the same time.
- One loads GFS data and the other is loading HRRR data.
- Ideally, we will just submit these together in one job. However, we are currently restricted to 4 cores per job on Ursa at the moment, so this makes the whole process go a bit faster.

Ideally, we will just submit these together in one job. We are restricted to 4 cores per job on Ursa at the moment, so this makes the whole process go a bit faster.

Additonally, we are only pulling in ~2 years of data. This takes about 10 hours to run. The maximum time for a service job on Ursa is 24 hours, so we are unable to pull in the fully archive (~approx 10 years). Once we (hopefully) can use more cores or run jobs for longer periods of time we can update this workflow to include all data, as its just a simple change in the yaml files.
Note:
We are only loading ~2 years of data as of right now. This takes about 10 hours to run. The maximum time for a service job on Ursa is 24 hours, so we are unable to pull in the full archive (~approx 10 years). Once we (hopefully) can use more cores or run jobs for longer periods of time we can update this workflow to include all data, as its just a simple change in the yaml files.
2 changes: 1 addition & 1 deletion nested_eagle/scientific_workflow/data/submit_grids.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
#SBATCH --mem=128g
#SBATCH -t 01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks=1

source /scratch4/NAGAPE/epic/role-epic/miniconda/bin/activate
conda activate eagle
Expand Down
8 changes: 4 additions & 4 deletions nested_eagle/scientific_workflow/inference/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
First, modify the checkpoint path in `inference_config.yaml` with the path to your checkpoint data from the training step. If you have used the defaults, this will just be changing the run ID to the one noted during that step.

Then, modify `submit_inference.sh` with your project account and the path to your miniconda installation.

Finally run the following to submit a job to create a 10-day forecast:
Run the following to submit a job to create a 10-day forecast:

`sbatch submit_inference.sh`

This will generate a NetCDF file with your forecast in the `inference_files` directory.

Note:
Within `inference_config.yaml` you will find a path to a checkpoint. The submit script updates that for you. However, if you have trained multiple models you may have to go edit this yourself to find the specific run_id you wish to use a checkpoint from.
19 changes: 14 additions & 5 deletions nested_eagle/scientific_workflow/training/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
Update path to your miniconda in `submit_training.sh`
Run`sbatch submit_training.sh`

Run:
`sbatch submit_training.sh`
After submission, go into the `outputs/` folder to monitor training. You will see:

Feel free to just let this run until you get a checkpoint saved out and then cancel. For the purposes of getting this workflow finished we just need a checkpoint to move onto the next step.
Logs
- Can be found within a folder including the date of your run (e.g. `2025-10-22`)

Checkpoints will be saved in `outputs/` folder during training.
Checkpoints
- Can be found within a folder that matches the run_id of your training. It will resemble something like `cf574663-cfa7-4ff2-aafd-37fb5af6bef5`

Plots
- The plots folder will also contain run_id folders.

-----------

TODO's
- We are currently not using multiple GPU's and need to implement that.
- Add configurations for other types of models and graphs that you can try.
2 changes: 0 additions & 2 deletions nested_eagle/scientific_workflow/training/submit_training.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,5 @@ module load cuda
module load gcc
export SLURM_GPUS_PER_NODE=1
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
#srun anemoi-training train --config-name=config


anemoi-training train --config-name=config
12 changes: 4 additions & 8 deletions nested_eagle/scientific_workflow/validation/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,7 @@
Run postprocessing script in your eagle conda env:
Run postprocessing script in your `eagle` conda env:
`python postprocess.py`

Next, install wxvx:
`conda create -y -n wxvx -c ufs-community -c paul.madden wxvx -c conda-forge --override-channels`

Activate:
`conda activate wxvx`

Run:
After post-processing is complete, run:
`wxvx -c wxvx_lam.yaml -t plots`

Now go to `run/plots/` and open some plots showing RMSE and ME!
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
#SBATCH --mem=128g
#SBATCH -t 01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks=1

source /scratch4/NAGAPE/epic/role-epic/miniconda/bin/activate

Expand Down
Loading