Skip to content

feat: Implement ingested_forecast_length utility and integrate with GFS (#412)#421

Open
ArkVex wants to merge 5 commits intodynamical-org:mainfrom
ArkVex:feature/ingested-forecast-length
Open

feat: Implement ingested_forecast_length utility and integrate with GFS (#412)#421
ArkVex wants to merge 5 commits intodynamical-org:mainfrom
ArkVex:feature/ingested-forecast-length

Conversation

@ArkVex
Copy link

@ArkVex ArkVex commented Feb 5, 2026

Description

This PR implements the logic to calculate and populate the ingested_forecast_length coordinate for the GFS dataset, as requested in #412.

This metric helps downstream users determine the maximum available lead time for each initialization time, allowing them to filter for "complete" forecasts.

Changes

  1. Shared Utility: Created src/reformatters/common/ingest_stats.py with a new function update_ingested_forecast_length.
    • Defined a HasTimeInfo Protocol to ensure type safety when processing coordinates.
    • Logic identifies the maximum lead time per init_time and updates the dataset in place.
  2. GFS Integration: Updated src/reformatters/noaa/gfs/region_job.py.
    • Overrode update_template_with_results to call the new utility after the standard update process.
  3. Testing: Added unit tests in tests/common/test_ingest_stats.py to verify:
    • Correct calculation of max lead times.
    • Proper handling of updates (overwriting smaller values with larger ones).
    • Handling of pd.NaT and empty states.

Testing

  • Added new unit tests: tests/common/test_ingest_stats.py.
  • Verified type checking passes with ty (using type: ignore for specific pandas timedelta edge cases).

Related Issue

Closes #412

CC @aldenks @JackKelly

@ArkVex ArkVex marked this pull request as draft February 5, 2026 20:06
@ArkVex
Copy link
Author

ArkVex commented Feb 8, 2026

@aldenks A GENTLE reminder

@JackKelly
Copy link
Collaborator

JackKelly commented Feb 9, 2026

Hi @ArkVex, if the PR is ready for review then it might be worth updating the title of the PR, and fix the failing tests, and "request" a review via the "Reviewers" setting on the top right 🙂.

Also, it's worth noting that Alden is a busy guy! reformatters evolves faster than most open source projects, but it's not uncommon for PRs to open source projects to go for weeks before being reviewed.

@ArkVex
Copy link
Author

ArkVex commented Feb 9, 2026

Ohh I apologise I was not aware of this...sure I'll fix the errors and change the pull request heading.
Thankyou @JackKelly

@aldenks
Copy link
Member

aldenks commented Feb 9, 2026

ah yeah @ArkVex, when you're ready for a review use the "Request Review" functionality github has in the right sidebar of the PR to ask for a review from me and i'll give you one! (i saw you marked as "draft" and the CI pipeline was failing and assumed it was a work in progress)

@ArkVex ArkVex changed the title fix: resolve type checking errors feat: Implement ingested_forecast_length utility and integrate with GFS (#412) Feb 11, 2026
@ArkVex ArkVex marked this pull request as ready for review February 11, 2026 00:32
@ArkVex
Copy link
Author

ArkVex commented Feb 11, 2026

Hi @aldenks and @JackKelly, I've checked the linting locally and it's passing. I'm having some trouble with the GitHub permissions for the 'Reviewers' sidebar, so I'm marking this as Ready for Review now. Please let me know if the CI failures on your end still look like linting issues to you!

(reformatters) PS C:\Users\Lenovo\OneDrive\Desktop\reformatters> uv run ruff check
All checks passed! ```

pd.Timestamp("2025-01-01 18:00"),
]

# We use 'cast' to silence the strict type checker here
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this comment :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🫣

@JackKelly
Copy link
Collaborator

I could be wrong but I think you can fix the ruff errors just by removing line 2 of test_ingest_stats.py (i.e. remove from typing import Any, cast)

Copy link
Member

@aldenks aldenks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ArkVex! This review is a little picky and i hope it doesn't scare you off.

Here's the two overarching motivations behind my review

  1. Philosophy towards comments: Write code that explains itself rather than needs comments. Comments are additional code to maintain and less to maintain is better. They also make it ~2x more to read which takes time and mental space and worst of all can be out of sync with the actual code and cause more confusion. Instead, use variable and function names and organize function logic so the code itself tells you what it's doing (you've done that). Imo comments are great to 1. explain something surprising or out of the ordinary (a gotcha, or where an optimization requires less clear code) 2. in select places to help explain the "why" behind something if we can't make the code say it.
  2. Don't smooth over errors, rather fail early if something doesn't match expectations (or expected failure modes). If an error is silently (including a log line) "handled" its still there, it will just turn up somewhere later, far away from its source and be harder to debug.

Please also add additional test coverage:

  • test that all ingested_forecast_length coordinate values that don't have init times in the process results are not modified
  • add a check in tests/noaa/gfs/forecast/dynamical_dataset_test.py::tests/noaa/gfs/forecast/dynamical_dataset_test.py that checks the coordinate value after the update step runs. This will ensure the updated template really threads through correctly and is written to the final zarr store. That test only processes a few lead times (for speed) so I'd expect the ingested_forecast_length for 2021-05-01T12:00:00 to be 3h

Comment on lines +26 to +30
if "ingested_forecast_length" not in template_ds.coords:
log.warning(
"ingested_forecast_length coordinate not found in template dataset."
)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "ingested_forecast_length" not in template_ds.coords:
log.warning(
"ingested_forecast_length coordinate not found in template dataset."
)
return
assert "ingested_forecast_length" in template_ds.coords

Comment on lines +13 to +16
# This Protocol tells the type checker: "Trust me, these objects have time info"
class HasTimeInfo(Protocol):
init_time: Timestamp
lead_time: Timedelta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This Protocol tells the type checker: "Trust me, these objects have time info"
class HasTimeInfo(Protocol):
init_time: Timestamp
lead_time: Timedelta
class DeterministicForecastSourceFileCoord(Protocol):
init_time: Timestamp
lead_time: Timedelta

Comment on lines +45 to +57
if init_time in template_ds.coords["init_time"]:
current_val = template_ds["ingested_forecast_length"].loc[
{"init_time": init_time}
]

# Use .values and pd.isnull to safely check for NaT (Not a Time)
if pd.isnull(current_val.values) or max_lead > current_val:
log.info(
f"Updating ingested_forecast_length for {init_time} to {max_lead}"
)
template_ds["ingested_forecast_length"].loc[
{"init_time": init_time}
] = max_lead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to look at existing values because of the way we update datasets by overwriting everything in a shard, so overwriting with whatever we processed this run is correct. In practice, we make sure we're only adding to a dataset, but that happens outside of here.

Suggested change
if init_time in template_ds.coords["init_time"]:
current_val = template_ds["ingested_forecast_length"].loc[
{"init_time": init_time}
]
# Use .values and pd.isnull to safely check for NaT (Not a Time)
if pd.isnull(current_val.values) or max_lead > current_val:
log.info(
f"Updating ingested_forecast_length for {init_time} to {max_lead}"
)
template_ds["ingested_forecast_length"].loc[
{"init_time": init_time}
] = max_lead
template_ds["ingested_forecast_length"].loc[
{"init_time": init_time}
] = max_lead


def update_ingested_forecast_length(
template_ds: xr.Dataset,
results_coords: Sequence[HasTimeInfo],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's allow callers to pass in the process_results directly and handle taking the max across variable names (the str key in this Mapping are variable names) within this function, rather than needing to make all callers do the same flattening into a Sequence[DeterministicForecastSourceFileCoord]

Suggested change
results_coords: Sequence[HasTimeInfo],
results_coords: Mapping[str, Sequence[DeterministicForecastSourceFileCoord]] ,

Then also add to the docstring the note that "The maximum processed lead time across all variables is set as the ingested_forecast_length. This can hide the nuance of a specific variable having fewer lead times processed than others."

def update_ingested_forecast_length(
template_ds: xr.Dataset,
results_coords: Sequence[HasTimeInfo],
) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> None:
) -> xr.Dataset:

lets have this return the modified dataset so callers would do ds = update_ingested_forecast_length(...)

Comment on lines +158 to +172
# 1. Run the standard update logic from the parent class
# This returns the updated dataset
ds = super().update_template_with_results(process_results)

# 2. Extract the coordinates from the dictionary
# process_results is { "filename": [coord1, coord2], ... }
all_coords = []
for coord_list in process_results.values():
all_coords.extend(coord_list)

# 3. Run our new logic
update_ingested_forecast_length(ds, all_coords)

# 4. Return the modified dataset (Crucial!)
return ds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less code is easier to understand code!

Suggested change
# 1. Run the standard update logic from the parent class
# This returns the updated dataset
ds = super().update_template_with_results(process_results)
# 2. Extract the coordinates from the dictionary
# process_results is { "filename": [coord1, coord2], ... }
all_coords = []
for coord_list in process_results.values():
all_coords.extend(coord_list)
# 3. Run our new logic
update_ingested_forecast_length(ds, all_coords)
# 4. Return the modified dataset (Crucial!)
return ds
ds = super().update_template_with_results(process_results)
return update_ingested_forecast_length(ds, process_results)

@@ -0,0 +1,85 @@
from collections.abc import Mapping
from typing import Any, cast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove, unused

def test_update_ingested_forecast_length_update_existing() -> None:
init_time = pd.Timestamp("2025-01-01 12:00")

# Start with 6 hours already recorded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all the comments in this file except for this one. This one is helpful because it highlights the case we're testing

@ArkVex
Copy link
Author

ArkVex commented Feb 12, 2026

Thank you for the detailed feedback, @aldenks ! I really appreciate the breakdown of the project's philosophy. I’ll refactor the utility to handle the mapping directly, clean up the comments, and add those extra test cases today.

@ArkVex ArkVex marked this pull request as draft February 12, 2026 21:55
@ArkVex ArkVex marked this pull request as ready for review February 13, 2026 14:17
@ArkVex ArkVex requested a review from aldenks February 13, 2026 14:17
@ArkVex
Copy link
Author

ArkVex commented Feb 13, 2026

@aldenks I feel now the PR is ready for the review...I tried my best to implement your reviews as much as possible :)

Copy link
Member

@aldenks aldenks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ArkVex, please do a closer self review of these changes.

Most of my comments weren't addressed or only partially addressed.

Im pretty sure after your changes this code will not run nor type check either. Make sure it does, instructions are in AGENTS.md.

Remove new Claude temp files.

If you have any questions about my comments I'm happy to answer, just tag me.

@ArkVex
Copy link
Author

ArkVex commented Feb 14, 2026

Nooo wayyy 😭
I will take some time before submitting for review...sorry @aldenks i wont let claude messup the next pr 🙏
Thankyou for your patience 😭 😭

@ArkVex ArkVex marked this pull request as draft February 14, 2026 17:59
@ArkVex
Copy link
Author

ArkVex commented Feb 14, 2026

@aldenks I've gone through all your feedback and made some changes so i thought I must consult you before commiting again...Here is the changes that i have done. Plz lemme know if I miss anything :)

What I fixed:

  • Removed the HasTimeInfo Protocol
  • Simplified region_job.py to just 2 lines - now passes process_results directly
  • Removed the conditional check when updating coordinates (fail early like you mentioned)
  • Changed function to return the dataset instead of None
  • Replaced warning with assert
  • Cleaned up all the unnecessary comments in tests
  • Removed unused imports

Tests added:

  • Test for unmodified init_times staying unchanged
  • Integration test in dynamical_dataset_test.py checking ingested_forecast_length = 3h

Everything passes:

  • ty check ✓
  • ruff check ✓
  • all unit tests ✓
  • integration test ✓

@ArkVex
Copy link
Author

ArkVex commented Feb 14, 2026

I might have missed somethings again...plz lemme know your thoughts.
And yes I am open to criticism 😭

@aldenks aldenks marked this pull request as ready for review February 18, 2026 13:38
Copy link
Member

@aldenks aldenks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ArkVex, didn't mean to leave you hanging. I'll be a bit slow this week, I'm on vacation.

You're much closer! I think some changes didn't get pushed?

I don't see the dynamical dataset integration test you mentioned

@@ -0,0 +1 @@
/c/Users/Lenovo/OneDrive/Desktop/reformatters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all these tmpclaude files

Comment on lines +158 to +166
ds = super().update_template_with_results(process_results)

all_coords = []
for coord_list in process_results.values():
all_coords.extend(coord_list)

update_ingested_forecast_length(ds, all_coords)

return ds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your changes here didn't get pushed?

lead_time: Timedelta


class HasTimeInfo(Protocol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

def out_loc(self) -> Mapping[Dim, CoordinateValueOrRange]:
return {}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are missing a test that checks that the existing values in the array not not modified.

Copy link
Member

@aldenks aldenks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ArkVex, didn't meant to leave you hanging, I've been on vacation so I'll be slower to respond this week.

This is much closer! See individual comments above.

I think some of your changes might not have been pushed. I don't see the integration test you mentioned.

Also look at the failed "checks" logs each time you push if you aren't able to run them locally. Theres some type checking errors in there now

@ArkVex
Copy link
Author

ArkVex commented Feb 19, 2026

@aldenks I didnt push the code yet 😢
I just wanted you to confirm the changes that i have mentioned above...
I am pushing the code now...plz take your time to review i am not in a hurry :)

]

empty_deltas = np.array([pd.NaT, pd.NaT], dtype="timedelta64[ns]")
empty_deltas = pd.to_timedelta([pd.NaT, pd.NaT]).to_numpy() # type: ignore[call-overload]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldenks just a headsup this line was failing in type so i had to ignore this line while type checking.

Copy link
Member

@aldenks aldenks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great @ArkVex. Next steps:

  • Fix test failures (see Checks logs)
  • add test that checks that existing init times which are not processed (no source file coords for them) have their ingested forecast length coordinate values left unchanged
  • implement this for NOAA HRRR forecast 48 hour

After that we could implement for the ensemble datasets (NOAA GEFS and ECMWF IFS ENS) but I'd recommend leaving that as a follow up PR because they have an additional ensemble member dimension.

@dynamical-org dynamical-org deleted a comment from ArkVex Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calculate and write ingested_forecast_length for GEFS, GFS, NOAA, and ECMWF IFS ENS forecast datasets

3 participants