diff --git a/README.md b/README.md index 3084fbc..1e1df82 100644 --- a/README.md +++ b/README.md @@ -1,11 +1,25 @@ -# global-eagle +# Welcome to Eagle! -Go to examples folder to run the full `ufs2arco + Anemoi + wxvx` pipeline, which includes: -1) Use `ufs2arco` to create training and validation datasets -2) Use `anemoi-core` modules to train a graph-based model -3) Use `anemoi-inference` to run inference -4) Use `wxvx` to verify forecasts +This repository contains various configurations to complete a machine learning pipeline for weather prediction! Various directories will guide you through an entire ML pipeline. -See the examples folder to use 1 year of NOAA Replay Reanalysis data to train a model for 1,000 steps, create a forecast, and verify that forecast. +The key steps include: +1) Data preprocessing using ufs2arco to create training, validation, and test datasets +2) Model training using anemoi-core modules to train a graph-based model +3) Creating a forecast with anemoi-inference to run inference from a model checkpoint +4) Verifying your forecast (or multiple!) with wxvx to verify against gridded analysis or observervations -NOTE: This repository is currently under development. +For more information about each step, please see our documentation: https://global-eagle.readthedocs.io/en/latest/ + +--------------------- + +Acknowledgments: + +ufs2arco: Tim Smith (NOAA Physical Sciences Laboratory) +- https://github.com/NOAA-PSL/ufs2arco + +Anemoi: European Centre for Medium-Range Weather Forecasts +- https://github.com/ecmwf/anemoi-core +- https://github.com/ecmwf/anemoi-inference + +wxvx: Paul Madden (NOAA Global Systems Laboratory/Cooperative Institute for Research In Environmental Sciences) +- https://github.com/maddenp-cu/wxvx diff --git a/nested_eagle/README.md b/nested_eagle/README.md index 66ad1c3..9dc259d 100644 --- a/nested_eagle/README.md +++ b/nested_eagle/README.md @@ -1,14 +1,31 @@ -conda +There are two folders within this `nested_eagle` directory: +1) `scientific_workflow` will guide you through a whole ML pipeline to create training data, train a model, run inference, and verify a forecast. +2) `operational_inference` provides scripts to run inference from a checkpoint in near real time. These scripts will assume you have a checkpoint from completing `scientifc_workflow` and want to run a near real time forecast with it. + +---------- + +Before starting anything, you must create two conda environments. +1) `eagle` environment to use for data creation, training, and inference +2) `wxvx` environment to use for verification + +These environments have already been made for you on Ursa and can by found by running`source /scratch4/NAGAPE/epic/role-epic/miniconda/bin/activate` + +Then, simply activate the environments by running `conda activate eagle` or `conda activate wxvx` + +---------- + +To create the necessary environments yourself, run the following commands: + +`eagle` environment to use for data creation, training, and inference: ``` -conda create -n eagle python=3.11 -conda activate eagle module load cuda gcc openmpi -conda install -c conda-forge xesmf -conda install -c conda-forge ufs2arco=0.17.1 -conda install -c conda-forge matplotlib cartopy cmocean -pip install 'torch<2.7' anemoi-datasets==0.5.26 anemoi-graphs==0.6.4 anemoi-models==0.9.2 anemoi-training==0.6.2 anemoi-inference==0.7.1 anemoi-utils==0.4.35 anemoi-transform==0.1.16 +conda env create -f environment.yaml +conda activate eagle pip install 'flash-attn<2.8' --no-build-isolation -pip install eagle-tools -conda install -c conda-forge esmf=8.7.0=nompi* ``` + +`wxvx` environment to use for verification: +``` +conda create -y -n wxvx -c ufs-community -c paul.madden wxvx -c conda-forge --override-channels +``` \ No newline at end of file diff --git a/environment.yaml b/nested_eagle/environment.yaml similarity index 86% rename from environment.yaml rename to nested_eagle/environment.yaml index 1d983a4..239df46 100644 --- a/environment.yaml +++ b/nested_eagle/environment.yaml @@ -4,7 +4,8 @@ channels: - defaults dependencies: - python=3.11 - - ufs2arco + - ufs2arco=0.17.1 + - "esmf=8.7.0=nompi*" - pip - pip: - torch<2.7 @@ -15,5 +16,4 @@ dependencies: - anemoi-inference==0.7.1 - anemoi-utils==0.4.35 - anemoi-transform==0.1.16 - - flash-attn<2.8 --no-build-isolation - eagle-tools diff --git a/nested_eagle/scientific_workflow/data/README.md b/nested_eagle/scientific_workflow/data/README.md index 72ec855..ba33fee 100644 --- a/nested_eagle/scientific_workflow/data/README.md +++ b/nested_eagle/scientific_workflow/data/README.md @@ -1,9 +1,13 @@ -Run `sbatch submit_grids.sh` first. +Run `sbatch submit_grids.sh` +- This creates some static grid files that will be used for regridding later in the pipeline. +- Once this has completed you can move onto dataset creations. +- Note: this creates static files that you can reuse, so if you run through this pipeline multiple times it may not be necessary to always re-run this. -Run `sbatch submit_gfs.sh` next. -Run `sbatch submit_hrrr.sh`. +Run `sbatch submit_gfs.sh` followed by `sbatch submit_hrrr.sh` +- You can run both of these at the same time. +- One loads GFS data and the other is loading HRRR data. +- Ideally, we will just submit these together in one job. However, we are currently restricted to 4 cores per job on Ursa at the moment, so this makes the whole process go a bit faster. -Ideally, we will just submit these together in one job. We are restricted to 4 cores per job on Ursa at the moment, so this makes the whole process go a bit faster. - -Additonally, we are only pulling in ~2 years of data. This takes about 10 hours to run. The maximum time for a service job on Ursa is 24 hours, so we are unable to pull in the fully archive (~approx 10 years). Once we (hopefully) can use more cores or run jobs for longer periods of time we can update this workflow to include all data, as its just a simple change in the yaml files. \ No newline at end of file +Note: +We are only loading ~2 years of data as of right now. This takes about 10 hours to run. The maximum time for a service job on Ursa is 24 hours, so we are unable to pull in the full archive (~approx 10 years). Once we (hopefully) can use more cores or run jobs for longer periods of time we can update this workflow to include all data, as its just a simple change in the yaml files. \ No newline at end of file diff --git a/nested_eagle/scientific_workflow/data/submit_grids.sh b/nested_eagle/scientific_workflow/data/submit_grids.sh index 14592af..df3e6e2 100644 --- a/nested_eagle/scientific_workflow/data/submit_grids.sh +++ b/nested_eagle/scientific_workflow/data/submit_grids.sh @@ -8,7 +8,7 @@ #SBATCH --mem=128g #SBATCH -t 01:00:00 #SBATCH --nodes=1 -#SBATCH --ntasks=4 +#SBATCH --ntasks=1 source /scratch4/NAGAPE/epic/role-epic/miniconda/bin/activate conda activate eagle diff --git a/nested_eagle/scientific_workflow/inference/README.md b/nested_eagle/scientific_workflow/inference/README.md index 3e863bc..6c3cb28 100644 --- a/nested_eagle/scientific_workflow/inference/README.md +++ b/nested_eagle/scientific_workflow/inference/README.md @@ -1,9 +1,9 @@ -First, modify the checkpoint path in `inference_config.yaml` with the path to your checkpoint data from the training step. If you have used the defaults, this will just be changing the run ID to the one noted during that step. -Then, modify `submit_inference.sh` with your project account and the path to your miniconda installation. - -Finally run the following to submit a job to create a 10-day forecast: +Run the following to submit a job to create a 10-day forecast: `sbatch submit_inference.sh` This will generate a NetCDF file with your forecast in the `inference_files` directory. + +Note: +Within `inference_config.yaml` you will find a path to a checkpoint. The submit script updates that for you. However, if you have trained multiple models you may have to go edit this yourself to find the specific run_id you wish to use a checkpoint from. diff --git a/nested_eagle/scientific_workflow/training/README.md b/nested_eagle/scientific_workflow/training/README.md index b8cd2fd..f1350fc 100644 --- a/nested_eagle/scientific_workflow/training/README.md +++ b/nested_eagle/scientific_workflow/training/README.md @@ -1,9 +1,18 @@ -Update path to your miniconda in `submit_training.sh` +Run`sbatch submit_training.sh` -Run: -`sbatch submit_training.sh` +After submission, go into the `outputs/` folder to monitor training. You will see: -Feel free to just let this run until you get a checkpoint saved out and then cancel. For the purposes of getting this workflow finished we just need a checkpoint to move onto the next step. +Logs +- Can be found within a folder including the date of your run (e.g. `2025-10-22`) -Checkpoints will be saved in `outputs/` folder during training. +Checkpoints +- Can be found within a folder that matches the run_id of your training. It will resemble something like `cf574663-cfa7-4ff2-aafd-37fb5af6bef5` +Plots +- The plots folder will also contain run_id folders. + +----------- + +TODO's +- We are currently not using multiple GPU's and need to implement that. +- Add configurations for other types of models and graphs that you can try. diff --git a/nested_eagle/scientific_workflow/training/submit_training.sh b/nested_eagle/scientific_workflow/training/submit_training.sh index f22949f..a64a8de 100644 --- a/nested_eagle/scientific_workflow/training/submit_training.sh +++ b/nested_eagle/scientific_workflow/training/submit_training.sh @@ -19,7 +19,5 @@ module load cuda module load gcc export SLURM_GPUS_PER_NODE=1 export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH -#srun anemoi-training train --config-name=config - anemoi-training train --config-name=config diff --git a/nested_eagle/scientific_workflow/validation/README.md b/nested_eagle/scientific_workflow/validation/README.md index 952d505..5b46ee7 100644 --- a/nested_eagle/scientific_workflow/validation/README.md +++ b/nested_eagle/scientific_workflow/validation/README.md @@ -1,11 +1,7 @@ -Run postprocessing script in your eagle conda env: +Run postprocessing script in your `eagle` conda env: `python postprocess.py` -Next, install wxvx: -`conda create -y -n wxvx -c ufs-community -c paul.madden wxvx -c conda-forge --override-channels` - -Activate: -`conda activate wxvx` - -Run: +After post-processing is complete, run: `wxvx -c wxvx_lam.yaml -t plots` + +Now go to `run/plots/` and open some plots showing RMSE and ME! diff --git a/nested_eagle/scientific_workflow/validation/submit_validation.sh b/nested_eagle/scientific_workflow/validation/submit_validation.sh index 7e638d2..fed5879 100644 --- a/nested_eagle/scientific_workflow/validation/submit_validation.sh +++ b/nested_eagle/scientific_workflow/validation/submit_validation.sh @@ -8,7 +8,7 @@ #SBATCH --mem=128g #SBATCH -t 01:00:00 #SBATCH --nodes=1 -#SBATCH --ntasks=4 +#SBATCH --ntasks=1 source /scratch4/NAGAPE/epic/role-epic/miniconda/bin/activate