feat: Implement ingested_forecast_length utility and integrate with GFS (#412) by ArkVex · Pull Request #421 · dynamical-org/reformatters

ArkVex · 2026-02-05T19:55:58Z

Description

This PR implements the logic to calculate and populate the ingested_forecast_length coordinate for the GFS dataset, as requested in #412.

This metric helps downstream users determine the maximum available lead time for each initialization time, allowing them to filter for "complete" forecasts.

Changes

Shared Utility: Created src/reformatters/common/ingest_stats.py with a new function update_ingested_forecast_length.
- Defined a HasTimeInfo Protocol to ensure type safety when processing coordinates.
- Logic identifies the maximum lead time per init_time and updates the dataset in place.
GFS Integration: Updated src/reformatters/noaa/gfs/region_job.py.
- Overrode update_template_with_results to call the new utility after the standard update process.
Testing: Added unit tests in tests/common/test_ingest_stats.py to verify:
- Correct calculation of max lead times.
- Proper handling of updates (overwriting smaller values with larger ones).
- Handling of pd.NaT and empty states.

Testing

Added new unit tests: tests/common/test_ingest_stats.py.
Verified type checking passes with ty (using type: ignore for specific pandas timedelta edge cases).

Related Issue

Closes #412

CC @aldenks @JackKelly

ArkVex · 2026-02-08T11:27:03Z

@aldenks A GENTLE reminder

JackKelly · 2026-02-09T09:44:53Z

Hi @ArkVex, if the PR is ready for review then it might be worth updating the title of the PR, and fix the failing tests, and "request" a review via the "Reviewers" setting on the top right 🙂.

Also, it's worth noting that Alden is a busy guy! reformatters evolves faster than most open source projects, but it's not uncommon for PRs to open source projects to go for weeks before being reviewed.

ArkVex · 2026-02-09T10:53:17Z

Ohh I apologise I was not aware of this...sure I'll fix the errors and change the pull request heading.
Thankyou @JackKelly

aldenks · 2026-02-09T16:13:01Z

ah yeah @ArkVex, when you're ready for a review use the "Request Review" functionality github has in the right sidebar of the PR to ask for a review from me and i'll give you one! (i saw you marked as "draft" and the CI pipeline was failing and assumed it was a work in progress)

ArkVex · 2026-02-11T00:38:42Z

Hi @aldenks and @JackKelly, I've checked the linting locally and it's passing. I'm having some trouble with the GitHub permissions for the 'Reviewers' sidebar, so I'm marking this as Ready for Review now. Please let me know if the CI failures on your end still look like linting issues to you!

(reformatters) PS C:\Users\Lenovo\OneDrive\Desktop\reformatters> uv run ruff check
All checks passed! ```

JackKelly · 2026-02-11T09:38:47Z

tests/common/test_ingest_stats.py

+        pd.Timestamp("2025-01-01 18:00"),
+    ]
+
+    # We use 'cast' to silence the strict type checker here


Please remove this comment :)

JackKelly · 2026-02-11T09:38:52Z

I could be wrong but I think you can fix the ruff errors just by removing line 2 of test_ingest_stats.py (i.e. remove from typing import Any, cast)

aldenks

Thank you @ArkVex! This review is a little picky and i hope it doesn't scare you off.

Here's the two overarching motivations behind my review

Philosophy towards comments: Write code that explains itself rather than needs comments. Comments are additional code to maintain and less to maintain is better. They also make it ~2x more to read which takes time and mental space and worst of all can be out of sync with the actual code and cause more confusion. Instead, use variable and function names and organize function logic so the code itself tells you what it's doing (you've done that). Imo comments are great to 1. explain something surprising or out of the ordinary (a gotcha, or where an optimization requires less clear code) 2. in select places to help explain the "why" behind something if we can't make the code say it.
Don't smooth over errors, rather fail early if something doesn't match expectations (or expected failure modes). If an error is silently (including a log line) "handled" its still there, it will just turn up somewhere later, far away from its source and be harder to debug.

Please also add additional test coverage:

test that all ingested_forecast_length coordinate values that don't have init times in the process results are not modified
add a check in tests/noaa/gfs/forecast/dynamical_dataset_test.py::tests/noaa/gfs/forecast/dynamical_dataset_test.py that checks the coordinate value after the update step runs. This will ensure the updated template really threads through correctly and is written to the final zarr store. That test only processes a few lead times (for speed) so I'd expect the ingested_forecast_length for 2021-05-01T12:00:00 to be 3h

aldenks · 2026-02-12T15:38:43Z

src/reformatters/common/ingest_stats.py

+    if "ingested_forecast_length" not in template_ds.coords:
+        log.warning(
+            "ingested_forecast_length coordinate not found in template dataset."
+        )
+        return


Suggested change

if "ingested_forecast_length" not in template_ds.coords:

log.warning(

"ingested_forecast_length coordinate not found in template dataset."

)

return

assert "ingested_forecast_length" in template_ds.coords

aldenks · 2026-02-12T15:40:10Z

src/reformatters/common/ingest_stats.py

+# This Protocol tells the type checker: "Trust me, these objects have time info"
+class HasTimeInfo(Protocol):
+    init_time: Timestamp
+    lead_time: Timedelta


Suggested change

# This Protocol tells the type checker: "Trust me, these objects have time info"

class HasTimeInfo(Protocol):

init_time: Timestamp

lead_time: Timedelta

class DeterministicForecastSourceFileCoord(Protocol):

init_time: Timestamp

lead_time: Timedelta

aldenks · 2026-02-12T15:49:59Z

src/reformatters/common/ingest_stats.py

+        if init_time in template_ds.coords["init_time"]:
+            current_val = template_ds["ingested_forecast_length"].loc[
+                {"init_time": init_time}
+            ]
+
+            # Use .values and pd.isnull to safely check for NaT (Not a Time)
+            if pd.isnull(current_val.values) or max_lead > current_val:
+                log.info(
+                    f"Updating ingested_forecast_length for {init_time} to {max_lead}"
+                )
+                template_ds["ingested_forecast_length"].loc[
+                    {"init_time": init_time}
+                ] = max_lead


We don't want to look at existing values because of the way we update datasets by overwriting everything in a shard, so overwriting with whatever we processed this run is correct. In practice, we make sure we're only adding to a dataset, but that happens outside of here.

Suggested change

if init_time in template_ds.coords["init_time"]:

current_val = template_ds["ingested_forecast_length"].loc[

{"init_time": init_time}

]

# Use .values and pd.isnull to safely check for NaT (Not a Time)

if pd.isnull(current_val.values) or max_lead > current_val:

log.info(

f"Updating ingested_forecast_length for {init_time} to {max_lead}"

)

template_ds["ingested_forecast_length"].loc[

{"init_time": init_time}

] = max_lead

template_ds["ingested_forecast_length"].loc[

{"init_time": init_time}

] = max_lead

aldenks · 2026-02-12T15:56:29Z

src/reformatters/common/ingest_stats.py

+
+def update_ingested_forecast_length(
+    template_ds: xr.Dataset,
+    results_coords: Sequence[HasTimeInfo],


let's allow callers to pass in the process_results directly and handle taking the max across variable names (the str key in this Mapping are variable names) within this function, rather than needing to make all callers do the same flattening into a Sequence[DeterministicForecastSourceFileCoord]

Suggested change

results_coords: Sequence[HasTimeInfo],

results_coords: Mapping[str, Sequence[DeterministicForecastSourceFileCoord]] ,

Then also add to the docstring the note that "The maximum processed lead time across all variables is set as the ingested_forecast_length. This can hide the nuance of a specific variable having fewer lead times processed than others."

aldenks · 2026-02-12T15:58:50Z

src/reformatters/common/ingest_stats.py

+def update_ingested_forecast_length(
+    template_ds: xr.Dataset,
+    results_coords: Sequence[HasTimeInfo],
+) -> None:


Suggested change

) -> None:

) -> xr.Dataset:

lets have this return the modified dataset so callers would do ds = update_ingested_forecast_length(...)

aldenks · 2026-02-12T15:59:57Z

src/reformatters/noaa/gfs/region_job.py

+        # 1. Run the standard update logic from the parent class
+        # This returns the updated dataset
+        ds = super().update_template_with_results(process_results)
+
+        # 2. Extract the coordinates from the dictionary
+        # process_results is { "filename": [coord1, coord2], ... }
+        all_coords = []
+        for coord_list in process_results.values():
+            all_coords.extend(coord_list)
+
+        # 3. Run our new logic
+        update_ingested_forecast_length(ds, all_coords)
+
+        # 4. Return the modified dataset (Crucial!)
+        return ds


less code is easier to understand code!

Suggested change

# 1. Run the standard update logic from the parent class

# This returns the updated dataset

ds = super().update_template_with_results(process_results)

# 2. Extract the coordinates from the dictionary

# process_results is { "filename": [coord1, coord2], ... }

all_coords = []

for coord_list in process_results.values():

all_coords.extend(coord_list)

# 3. Run our new logic

update_ingested_forecast_length(ds, all_coords)

# 4. Return the modified dataset (Crucial!)

return ds

ds = super().update_template_with_results(process_results)

return update_ingested_forecast_length(ds, process_results)

aldenks · 2026-02-12T16:00:29Z

tests/common/test_ingest_stats.py

@@ -0,0 +1,85 @@
+from collections.abc import Mapping
+from typing import Any, cast


remove, unused

src/reformatters/common/ingest_stats.py

aldenks · 2026-02-12T16:02:56Z

tests/common/test_ingest_stats.py

+def test_update_ingested_forecast_length_update_existing() -> None:
+    init_time = pd.Timestamp("2025-01-01 12:00")
+
+    # Start with 6 hours already recorded


Please remove all the comments in this file except for this one. This one is helpful because it highlights the case we're testing

ArkVex · 2026-02-12T18:38:22Z

Thank you for the detailed feedback, @aldenks ! I really appreciate the breakdown of the project's philosophy. I’ll refactor the utility to handle the mapping directly, clean up the comments, and add those extra test cases today.

ArkVex · 2026-02-13T14:19:24Z

@aldenks I feel now the PR is ready for the review...I tried my best to implement your reviews as much as possible :)

aldenks

Hi @ArkVex, please do a closer self review of these changes.

Most of my comments weren't addressed or only partially addressed.

Im pretty sure after your changes this code will not run nor type check either. Make sure it does, instructions are in AGENTS.md.

Remove new Claude temp files.

If you have any questions about my comments I'm happy to answer, just tag me.

ArkVex · 2026-02-14T13:11:37Z

Nooo wayyy 😭
I will take some time before submitting for review...sorry @aldenks i wont let claude messup the next pr 🙏
Thankyou for your patience 😭 😭

ArkVex · 2026-02-14T21:31:47Z

@aldenks I've gone through all your feedback and made some changes so i thought I must consult you before commiting again...Here is the changes that i have done. Plz lemme know if I miss anything :)

What I fixed:

Removed the HasTimeInfo Protocol
Simplified region_job.py to just 2 lines - now passes process_results directly
Removed the conditional check when updating coordinates (fail early like you mentioned)
Changed function to return the dataset instead of None
Replaced warning with assert
Cleaned up all the unnecessary comments in tests
Removed unused imports

Tests added:

Test for unmodified init_times staying unchanged
Integration test in dynamical_dataset_test.py checking ingested_forecast_length = 3h

Everything passes:

ty check ✓
ruff check ✓
all unit tests ✓
integration test ✓

ArkVex · 2026-02-14T21:34:16Z

I might have missed somethings again...plz lemme know your thoughts.
And yes I am open to criticism 😭

aldenks

Hi @ArkVex, didn't mean to leave you hanging. I'll be a bit slow this week, I'm on vacation.

You're much closer! I think some changes didn't get pushed?

I don't see the dynamical dataset integration test you mentioned

aldenks · 2026-02-18T14:58:06Z

tmpclaude-0904-cwd

@@ -0,0 +1 @@
+/c/Users/Lenovo/OneDrive/Desktop/reformatters


Please remove all these tmpclaude files

aldenks · 2026-02-18T14:58:57Z

src/reformatters/noaa/gfs/region_job.py

+        ds = super().update_template_with_results(process_results)
+
+        all_coords = []
+        for coord_list in process_results.values():
+            all_coords.extend(coord_list)
+
+        update_ingested_forecast_length(ds, all_coords)
+
+        return ds


I think your changes here didn't get pushed?

aldenks · 2026-02-18T14:59:10Z

src/reformatters/common/ingest_stats.py

+    lead_time: Timedelta
+
+
+class HasTimeInfo(Protocol):


aldenks · 2026-02-18T15:00:02Z

tests/common/test_ingest_stats.py

+    def out_loc(self) -> Mapping[Dim, CoordinateValueOrRange]:
+        return {}
+
+


We are missing a test that checks that the existing values in the array not not modified.

aldenks

Hi @ArkVex, didn't meant to leave you hanging, I've been on vacation so I'll be slower to respond this week.

This is much closer! See individual comments above.

I think some of your changes might not have been pushed. I don't see the integration test you mentioned.

Also look at the failed "checks" logs each time you push if you aren't able to run them locally. Theres some type checking errors in there now

ArkVex · 2026-02-19T00:30:27Z

@aldenks I didnt push the code yet 😢
I just wanted you to confirm the changes that i have mentioned above...
I am pushing the code now...plz take your time to review i am not in a hurry :)

ArkVex · 2026-02-19T00:33:14Z

tests/common/test_ingest_stats.py

    ]

-    empty_deltas = np.array([pd.NaT, pd.NaT], dtype="timedelta64[ns]")
+    empty_deltas = pd.to_timedelta([pd.NaT, pd.NaT]).to_numpy()  # type: ignore[call-overload]


@aldenks just a headsup this line was failing in type so i had to ignore this line while type checking.

aldenks

Looking great @ArkVex. Next steps:

Fix test failures (see Checks logs)
add test that checks that existing init times which are not processed (no source file coords for them) have their ingested forecast length coordinate values left unchanged
implement this for NOAA HRRR forecast 48 hour

After that we could implement for the ensemble datasets (NOAA GEFS and ECMWF IFS ENS) but I'd recommend leaving that as a follow up PR because they have an additional ensemble member dimension.

fix: resolve type checking errors

6332ae1

ArkVex marked this pull request as draft February 5, 2026 20:06

ArkVex changed the title ~~fix: resolve type checking errors~~ feat: Implement ingested_forecast_length utility and integrate with GFS (#412) Feb 11, 2026

ArkVex marked this pull request as ready for review February 11, 2026 00:32

JackKelly reviewed Feb 11, 2026

View reviewed changes

aldenks requested changes Feb 12, 2026

View reviewed changes

ArkVex marked this pull request as draft February 12, 2026 21:55

ArkVex added 3 commits February 13, 2026 19:10

feat: implement ingested_forecast_length for GFS with full test coverage

6d5f762

feat: implement ingested_forecast_length for GFS with full test coverage

e79f998

additional fixes

6f0270c

ArkVex marked this pull request as ready for review February 13, 2026 14:17

ArkVex requested a review from aldenks February 13, 2026 14:17

aldenks requested changes Feb 14, 2026

View reviewed changes

ArkVex marked this pull request as draft February 14, 2026 17:59

aldenks marked this pull request as ready for review February 18, 2026 13:38

aldenks requested changes Feb 18, 2026

View reviewed changes

aldenks reviewed Feb 18, 2026

View reviewed changes

refactor: address code review feedback for ingested_forecast_length

75f169c

ArkVex commented Feb 19, 2026

View reviewed changes

aldenks reviewed Feb 20, 2026

View reviewed changes

dynamical-org deleted a comment from ArkVex Feb 25, 2026

	results_coords: Sequence[HasTimeInfo],
	results_coords: Mapping[str, Sequence[DeterministicForecastSourceFileCoord]] ,

		@@ -0,0 +1,85 @@
		from collections.abc import Mapping
		from typing import Any, cast

		def out_loc(self) -> Mapping[Dim, CoordinateValueOrRange]:
		return {}

		@@ -0,0 +1 @@
		/c/Users/Lenovo/OneDrive/Desktop/reformatters

Conversation

ArkVex commented Feb 5, 2026

Description

Changes

Testing

Related Issue

Uh oh!

ArkVex commented Feb 8, 2026

Uh oh!

JackKelly commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArkVex commented Feb 9, 2026

Uh oh!

aldenks commented Feb 9, 2026

Uh oh!

ArkVex commented Feb 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackKelly commented Feb 11, 2026

Uh oh!

aldenks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArkVex commented Feb 12, 2026

Uh oh!

ArkVex commented Feb 13, 2026

Uh oh!

aldenks left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArkVex commented Feb 14, 2026

Uh oh!

ArkVex commented Feb 14, 2026

Uh oh!

ArkVex commented Feb 14, 2026

Uh oh!

aldenks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aldenks left a comment

Choose a reason for hiding this comment

Uh oh!

ArkVex commented Feb 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aldenks left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackKelly commented Feb 9, 2026 •

edited

Loading

aldenks left a comment •

edited

Loading

aldenks left a comment •

edited

Loading