Skip to content

Conversation

@vtamsitt
Copy link

Added GLORYS 1/4 degree BGC dataset metadata to Copernicus.jl

A couple of things:

  1. There is no inpainting for sea ice variables because zeros are physical, not missing values'. This should also be the case for chlorophyll, phytoplankton, primary productivity I think? So should probably remove inpainting for these.
  2. I split the dicts of variable names up for the data streams that have different variables (similar to ECCO.jl), but the metadata_prefix variable naming hasn't been updated to reflect this yet. It would be good to add a download_dataset function for Copernicus.jl as well and I'm not sure the file naming convention from Copernicus matches what's currently in metadata_prefix.

@vtamsitt
Copy link
Author

Oh I see download_dataset is addressed in #522

@navidcy navidcy marked this pull request as ready for review September 4, 2025 23:59
@navidcy navidcy self-requested a review September 4, 2025 23:59
@navidcy
Copy link
Member

navidcy commented Sep 5, 2025

I'm happy to merge this.
@glwagner, @vtamsitt?

@navidcy navidcy added the data wrangling We must feed the models so they don't get cranky label Sep 5, 2025
@codecov
Copy link

codecov bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 33.33333% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 18.69%. Comparing base (b3adc2f) to head (44c69f3).

Files with missing lines Patch % Lines
src/DataWrangling/Copernicus/Copernicus.jl 33.33% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #607      +/-   ##
==========================================
- Coverage   23.06%   18.69%   -4.38%     
==========================================
  Files          47       47              
  Lines        2844     2857      +13     
==========================================
- Hits          656      534     -122     
- Misses       2188     2323     +135     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 78 to 82
available_variables(::GLORYSStatic) = copernicus_physics_dataset_variable_names
available_variables(::GLORYSDaily) = copernicus_physics_dataset_variable_names
available_variables(::GLORYSMonthly) = copernicus_physics_dataset_variable_names
available_variables(::GLORYSBGCDaily) = copernicus_bgc_daily_dataset_variable_names
available_variables(::GLORYSBGCMonthly) = copernicus_bgc_monthly_dataset_variable_names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could we have available_variables give out different list of variables based on the dataset type if all variables are in one dictionary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's why I did them separately, especially because the BGC variables are non-overlapping

Copy link
Member

@glwagner glwagner Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using a single "master" copernicus_bgc_monthly_dataset_variable_names, and then writing out the available variable names explicitly in these functions. That way we have a single readable reference for all of the variables that can be downloaded from copernicus. It might become important as the number of variables grows.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I reverted to a 'master' dict copernicus_dataset_variable_names and wrote out the variables explicitly in the avaiable_variables function. Is that what you were thinking?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glwagner does this look ok now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thank you! I can eliminate the repeated code by creating a more organized type hierarchy. do you mind if I commit to your branch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go for it!

@glwagner
Copy link
Member

@vtamsitt I made these changes:

  • I changed available_variables to return a tuple (a simple list of names) rather than a dictionary (pairs of values linking a long and short name). As far as I can tell, available_variables is supposed to be a list of names (more generally, we don't want to repeat information).
  • I added an abstract type GLORYSDataset, which allows us to express the fact that the GLORYS physics has one set of variables, while each GLORYSBGC dataset has different variables. There is no overlapping information now.
  • I changed available_variables for GLORYSBGCMonthly to call available_variables(GLORYSBGCDaily()). This documents how the monthly variables are a superset of the daily variables.

If you approve I think this is ready to merge. There are some downloading errors but they seem to be unrelated (the copernicus tests pass).

:ph,
:surface_co2,
:total_phytoplankton,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vtamsitt this expresses how the monthly variables are a superset of the daily variables


# Datasets
abstract type CopernicusDataset end
abstract type GLORYSDataset <: CopernicusDataset end
Copy link
Member

@glwagner glwagner Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstract type allows us to implement one available_variables for the two physics datasets GLORYSMonthly, GLORYSDaily

@vtamsitt
Copy link
Author

@vtamsitt I made these changes:

  • I changed available_variables to return a tuple (a simple list of names) rather than a dictionary (pairs of values linking a long and short name). As far as I can tell, available_variables is supposed to be a list of names (more generally, we don't want to repeat information).
  • I added an abstract type GLORYSDataset, which allows us to express the fact that the GLORYS physics has one set of variables, while each GLORYSBGC dataset has different variables. There is no overlapping information now.
  • I changed available_variables for GLORYSBGCMonthly to call available_variables(GLORYSBGCDaily()). This documents how the monthly variables are a superset of the daily variables.

If you approve I think this is ready to merge. There are some downloading errors but they seem to be unrelated (the copernicus tests pass).

This all looks good to me and fine to merge @glwagner! Thanks for documenting the changes.

@glwagner
Copy link
Member

@navidcy do you know anything about these failures?

@simone-silvestri
Copy link
Collaborator

looks like the ecco username or ecco password is not recognized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data wrangling We must feed the models so they don't get cranky

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants