Skip to content

fix(STEF-2854): handle backtest robustness issues#837

Merged
egordm merged 17 commits intorelease/v4.0.0from
fix/STEF-2854-handle-insufficient-training-data
Mar 19, 2026
Merged

fix(STEF-2854): handle backtest robustness issues#837
egordm merged 17 commits intorelease/v4.0.0from
fix/STEF-2854-handle-insufficient-training-data

Conversation

@egordm
Copy link
Collaborator

@egordm egordm commented Mar 13, 2026

Summary

Robustness improvements discovered during STEF-2854 model comparison backtest (ensemble benchmarking on STEF50 dataset).

Fixes

  1. InsufficientlyCompleteError raised during WindowedMetricVisualization.format_html()metric_line can be shorter than reference_line when the metric time series is incomplete.

    • File: packages/openstef-beam/src/openstef_beam/evaluation/visualization/windowed_metric_visualization.py
  2. Too strict data validation in InsufficientlyCompleteError — the error message included the number of unique hours, but the check should only verify that the data is not empty.

    • File: packages/openstef-core/src/openstef_core/exceptions.py
  3. Pydantic serialization warning for Quantile typeQuantile.__get_pydantic_core_schema__ lacked a serializer, causing UserWarning: Pydantic serializer warnings when serializing models containing Quantile fields.

    • File: packages/openstef-core/src/openstef_core/types.py
  4. Broadcast shape mismatch in LearnedWeightsCombiner.fit()combine_forecast_input_datasets uses an inner join that can drop rows, but labels was computed before the join, causing a shape mismatch.

    • File: packages/openstef-meta/src/openstef_meta/models/forecast_combiners/learned_weights_combiner.py
  5. IndexError in chronological_train_test_split with near-empty datasets — when a dataset has fewer than 2 unique timestamps, the split function crashes with index 1 is out of bounds.

    • File: packages/openstef-models/src/openstef_models/utils/data_split.py
  6. Empty data after inner join in combiner crashes downstream — when _prepare_input_data inner join produces empty data (additional features have different datetime index), the combiner's predict/score path crashes with ValueError: Input data must be 2 dimensional and non empty or ValueError: Input contains NaN.

    • Fix: _prepare_input_data now validates the result isn't empty after the inner join and raises InsufficientlyCompleteError — which naturally propagates up and is caught by the backtest harness, retaining the previous model.
    • File: packages/openstef-meta/src/openstef_meta/models/forecast_combiners/learned_weights_combiner.py

Features

  1. nan_aware_weighted_mean helper — extracted reusable NaN-aware weighted mean with weight renormalization. Used by ensemble combiner.

    • File: packages/openstef-core/src/openstef_core/utils/math.py
    • Tests: packages/openstef-core/tests/test_math.py
  2. skip_analysis parameter for BenchmarkPipeline.run() — allows skipping the analysis phase during backtesting when only predictions are needed.

    • File: packages/openstef-beam/src/openstef_beam/benchmarking/benchmark_pipeline.py
  3. filterings override in AnalysisConfig — allows manual selection of which filterings (LeadTime, AvailableAt) are included in comparison analysis. Defaults to None (all subsets included, backward compatible).

    • File: packages/openstef-beam/src/openstef_beam/analysis/analysis_pipeline.py

Testing

  • All existing tests pass
  • Added test_quantile_serialization_no_warnings for fix 3
  • Added tests for nan_aware_weighted_mean helper

…aining

OpenSTEF4BacktestForecaster.fit() now catches InsufficientlyCompleteError
alongside FlatlinerDetectedError. When a training window has insufficient
non-NaN data, the training event is skipped and the previous model is
retained instead of crashing the entire target backtest.

Signed-off-by: Egor Dmitriev <[email protected]>
@egordm egordm requested a review from a team March 13, 2026 11:05
@github-actions github-actions bot added the fix Something isn't working label Mar 13, 2026
egordm added 3 commits March 13, 2026 12:26
… test

Use all-NaN load data with model_reuse_enable=False to trigger
InsufficientlyCompleteError naturally instead of patching workflow.fit.

Signed-off-by: Egor Dmitriev <[email protected]>
When the first fit fails due to InsufficientlyCompleteError, _workflow
stays None. predict() now returns None (like flatliner) instead of
raising NotFittedError, letting the benchmark pipeline skip gracefully.

Signed-off-by: Egor Dmitriev <[email protected]>
Skip runs/targets with no windowed metrics instead of raising ValueError.
Returns an HTML placeholder when all items in a visualization are empty.

Signed-off-by: Egor Dmitriev <[email protected]>
@egordm egordm changed the title fix(STEF-2854): handle InsufficientlyCompleteError during backtest training fix(STEF-2854): handle backtest robustness issues Mar 13, 2026
egordm added 3 commits March 13, 2026 16:42
… serialization

Two fixes:
1. learned_weights_combiner.py: Filter labels to match combined_data
   index after inner join drops rows from additional_features.
   Fixes ValueError: operands could not be broadcast together.
2. types.py: Add Pydantic serializer to Quantile to suppress
   PydanticSerializationUnexpectedValue warnings.

Signed-off-by: Egor Dmitriev <[email protected]>

Signed-off-by: Egor Dmitriev <[email protected]>
Quantile.__get_pydantic_core_schema__ only defined a validator but no
serializer. When Quantile values appear as dict keys in a union type
(e.g., QuantileOrGlobal = Quantile | Literal['global']), Pydantic emits
PydanticSerializationUnexpectedValue warnings during model_dump_json().

Add a plain_serializer_function_ser_schema(float) so Pydantic knows
how to serialize Quantile as a float, preventing the warning.

Signed-off-by: Egor Dmitriev <[email protected]>
Signed-off-by: Egor Dmitriev <[email protected]>
…in train/test split

chronological_train_test_split crashed with IndexError when the dataset
had fewer than 2 unique timestamps. This happens during ensemble backtest
when a base forecaster's preprocessed data is empty. Now raises
InsufficientlyCompleteError which is caught by the backtest harness.

Signed-off-by: Egor Dmitriev <[email protected]>
@egordm egordm force-pushed the fix/STEF-2854-handle-insufficient-training-data branch 2 times, most recently from 2b726c0 to e1cd1b2 Compare March 18, 2026 10:21
@egordm egordm force-pushed the fix/STEF-2854-handle-insufficient-training-data branch from e1cd1b2 to 625abfe Compare March 18, 2026 10:24
egordm added 3 commits March 18, 2026 11:52
…scope

The RUN_AND_GROUP scope was saving directly to the base analysis dir,
making has_analysis_output fail to locate it and colliding with group-level
outputs. Store in a 'global' subdirectory to match the group-level pattern
(group_name/global).

Signed-off-by: Egor Dmitriev <[email protected]>
…ons are NaN

When a base model cannot predict certain timestamps (e.g. gblinear limited
to 2-day weather horizon while lgbm predicts 7 days), the combiner must
redistribute the missing model's weight proportionally to the remaining
models.

Previously, pandas sum(axis=1, skipna=True) silently dropped the NaN
model's weight contribution, causing predictions to be systematically
scaled down by ~35% for timestamps beyond the weather horizon.

Now weights are reindexed to match predictions, zeroed where predictions
are NaN, and the weighted sum is divided by the available weight total.
When all models are NaN, the result is 0 (matching prior behavior).

Includes regression test with seeded data verifying no NaN propagation
and no systematic downscaling.

Signed-off-by: Egor Dmitriev <[email protected]>
Signed-off-by: Egor Dmitriev <[email protected]>
@egordm egordm force-pushed the fix/STEF-2854-handle-insufficient-training-data branch from d926bd6 to d2f64f4 Compare March 18, 2026 16:43
egordm added 5 commits March 18, 2026 20:45
Extract NaN-aware weight renormalization into a reusable helper in
openstef_core.utils.pandas and use it in learned_weights_combiner.
Removes type: ignore comments from _predict_quantile.

Signed-off-by: Egor Dmitriev <[email protected]>
Allows skipping per-target and global analysis steps when running
benchmarks. Useful when analysis will be run separately later
via the comparison pipeline.

Signed-off-by: Egor Dmitriev <[email protected]>
@sonarqubecloud
Copy link

Copy link
Collaborator

@lschilders lschilders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (mind the Sonarcloud error in one of the files)

@egordm egordm merged commit 7cf3859 into release/v4.0.0 Mar 19, 2026
4 checks passed
@egordm egordm deleted the fix/STEF-2854-handle-insufficient-training-data branch March 19, 2026 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants