fix(STEF-2854): handle backtest robustness issues by egordm · Pull Request #837 · OpenSTEF/openstef

egordm · 2026-03-13T11:05:08Z

Summary

Robustness improvements discovered during STEF-2854 model comparison backtest (ensemble benchmarking on STEF50 dataset).

Fixes

InsufficientlyCompleteError raised during WindowedMetricVisualization.format_html() — metric_line can be shorter than reference_line when the metric time series is incomplete.
- File: packages/openstef-beam/src/openstef_beam/evaluation/visualization/windowed_metric_visualization.py
Too strict data validation in InsufficientlyCompleteError — the error message included the number of unique hours, but the check should only verify that the data is not empty.
- File: packages/openstef-core/src/openstef_core/exceptions.py
Pydantic serialization warning for Quantile type — Quantile.__get_pydantic_core_schema__ lacked a serializer, causing UserWarning: Pydantic serializer warnings when serializing models containing Quantile fields.
- File: packages/openstef-core/src/openstef_core/types.py
Broadcast shape mismatch in LearnedWeightsCombiner.fit() — combine_forecast_input_datasets uses an inner join that can drop rows, but labels was computed before the join, causing a shape mismatch.
- File: packages/openstef-meta/src/openstef_meta/models/forecast_combiners/learned_weights_combiner.py
IndexError in chronological_train_test_split with near-empty datasets — when a dataset has fewer than 2 unique timestamps, the split function crashes with index 1 is out of bounds.
- File: packages/openstef-models/src/openstef_models/utils/data_split.py
Empty data after inner join in combiner crashes downstream — when _prepare_input_data inner join produces empty data (additional features have different datetime index), the combiner's predict/score path crashes with ValueError: Input data must be 2 dimensional and non empty or ValueError: Input contains NaN.
- Fix: _prepare_input_data now validates the result isn't empty after the inner join and raises InsufficientlyCompleteError — which naturally propagates up and is caught by the backtest harness, retaining the previous model.
- File: packages/openstef-meta/src/openstef_meta/models/forecast_combiners/learned_weights_combiner.py

Features

nan_aware_weighted_mean helper — extracted reusable NaN-aware weighted mean with weight renormalization. Used by ensemble combiner.
- File: packages/openstef-core/src/openstef_core/utils/math.py
- Tests: packages/openstef-core/tests/test_math.py
skip_analysis parameter for BenchmarkPipeline.run() — allows skipping the analysis phase during backtesting when only predictions are needed.
- File: packages/openstef-beam/src/openstef_beam/benchmarking/benchmark_pipeline.py
filterings override in AnalysisConfig — allows manual selection of which filterings (LeadTime, AvailableAt) are included in comparison analysis. Defaults to None (all subsets included, backward compatible).
- File: packages/openstef-beam/src/openstef_beam/analysis/analysis_pipeline.py

Testing

All existing tests pass
Added test_quantile_serialization_no_warnings for fix 3
Added tests for nan_aware_weighted_mean helper

…aining OpenSTEF4BacktestForecaster.fit() now catches InsufficientlyCompleteError alongside FlatlinerDetectedError. When a training window has insufficient non-NaN data, the training event is skipped and the previous model is retained instead of crashing the entire target backtest. Signed-off-by: Egor Dmitriev <[email protected]>

… test Use all-NaN load data with model_reuse_enable=False to trigger InsufficientlyCompleteError naturally instead of patching workflow.fit. Signed-off-by: Egor Dmitriev <[email protected]>

When the first fit fails due to InsufficientlyCompleteError, _workflow stays None. predict() now returns None (like flatliner) instead of raising NotFittedError, letting the benchmark pipeline skip gracefully. Signed-off-by: Egor Dmitriev <[email protected]>

Skip runs/targets with no windowed metrics instead of raising ValueError. Returns an HTML placeholder when all items in a visualization are empty. Signed-off-by: Egor Dmitriev <[email protected]>

…training-data

… serialization Two fixes: 1. learned_weights_combiner.py: Filter labels to match combined_data index after inner join drops rows from additional_features. Fixes ValueError: operands could not be broadcast together. 2. types.py: Add Pydantic serializer to Quantile to suppress PydanticSerializationUnexpectedValue warnings. Signed-off-by: Egor Dmitriev <[email protected]> Signed-off-by: Egor Dmitriev <[email protected]>

Quantile.__get_pydantic_core_schema__ only defined a validator but no serializer. When Quantile values appear as dict keys in a union type (e.g., QuantileOrGlobal = Quantile | Literal['global']), Pydantic emits PydanticSerializationUnexpectedValue warnings during model_dump_json(). Add a plain_serializer_function_ser_schema(float) so Pydantic knows how to serialize Quantile as a float, preventing the warning. Signed-off-by: Egor Dmitriev <[email protected]> Signed-off-by: Egor Dmitriev <[email protected]>

…in train/test split chronological_train_test_split crashed with IndexError when the dataset had fewer than 2 unique timestamps. This happens during ensemble backtest when a base forecaster's preprocessed data is empty. Now raises InsufficientlyCompleteError which is caught by the backtest harness. Signed-off-by: Egor Dmitriev <[email protected]>

…ata after inner join Signed-off-by: Egor Dmitriev <[email protected]>

…run() Signed-off-by: Egor Dmitriev <[email protected]>

…scope The RUN_AND_GROUP scope was saving directly to the base analysis dir, making has_analysis_output fail to locate it and colliding with group-level outputs. Store in a 'global' subdirectory to match the group-level pattern (group_name/global). Signed-off-by: Egor Dmitriev <[email protected]>

…ons are NaN When a base model cannot predict certain timestamps (e.g. gblinear limited to 2-day weather horizon while lgbm predicts 7 days), the combiner must redistribute the missing model's weight proportionally to the remaining models. Previously, pandas sum(axis=1, skipna=True) silently dropped the NaN model's weight contribution, causing predictions to be systematically scaled down by ~35% for timestamps beyond the weather horizon. Now weights are reindexed to match predictions, zeroed where predictions are NaN, and the weighted sum is divided by the available weight total. When all models are NaN, the result is 0 (matching prior behavior). Includes regression test with seeded data verifying no NaN propagation and no systematic downscaling. Signed-off-by: Egor Dmitriev <[email protected]> Signed-off-by: Egor Dmitriev <[email protected]>

Extract NaN-aware weight renormalization into a reusable helper in openstef_core.utils.pandas and use it in learned_weights_combiner. Removes type: ignore comments from _predict_quantile. Signed-off-by: Egor Dmitriev <[email protected]>

Allows skipping per-target and global analysis steps when running benchmarks. Useful when analysis will be run separately later via the comparison pipeline. Signed-off-by: Egor Dmitriev <[email protected]>

Signed-off-by: Egor Dmitriev <[email protected]>

sonarqubecloud · 2026-03-19T09:36:48Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

lschilders

LGTM! (mind the Sonarcloud error in one of the files)

egordm requested a review from a team March 13, 2026 11:05

github-actions bot added the fix Something isn't working label Mar 13, 2026

egordm added 3 commits March 13, 2026 12:26

test(STEF-2854): replace mock with real NaN data in insufficient-data…

9f830e9

… test Use all-NaN load data with model_reuse_enable=False to trigger InsufficientlyCompleteError naturally instead of patching workflow.fit. Signed-off-by: Egor Dmitriev <[email protected]>

fix(STEF-2854): make WindowedMetricVisualization robust to missing data

1ab0eef

Skip runs/targets with no windowed metrics instead of raising ValueError. Returns an HTML placeholder when all items in a visualization are empty. Signed-off-by: Egor Dmitriev <[email protected]>

egordm changed the title ~~fix(STEF-2854): handle InsufficientlyCompleteError during backtest training~~ fix(STEF-2854): handle backtest robustness issues Mar 13, 2026

egordm added 3 commits March 13, 2026 16:42

Merge branch 'release/v4.0.0' into fix/STEF-2854-handle-insufficient-…

dc170b9

…training-data

egordm mentioned this pull request Mar 18, 2026

fix(STEF-2854): add Pydantic serializer to Quantile to suppress warnings #838

Closed

egordm force-pushed the fix/STEF-2854-handle-insufficient-training-data branch 2 times, most recently from 2b726c0 to e1cd1b2 Compare March 18, 2026 10:21

fix(STEF-2854): raise InsufficientlyCompleteError on empty combiner d…

625abfe

…ata after inner join Signed-off-by: Egor Dmitriev <[email protected]>

egordm force-pushed the fix/STEF-2854-handle-insufficient-training-data branch from e1cd1b2 to 625abfe Compare March 18, 2026 10:24

egordm added 3 commits March 18, 2026 11:52

feat(STEF-2854): add strict parameter to BenchmarkComparisonPipeline.…

4d00735

…run() Signed-off-by: Egor Dmitriev <[email protected]>

egordm force-pushed the fix/STEF-2854-handle-insufficient-training-data branch from d926bd6 to d2f64f4 Compare March 18, 2026 16:43

egordm added 5 commits March 18, 2026 20:45

feat(STEF-2854): add skip_analysis param to BenchmarkPipeline.run()

aacbdde

Allows skipping per-target and global analysis steps when running benchmarks. Useful when analysis will be run separately later via the comparison pipeline. Signed-off-by: Egor Dmitriev <[email protected]>

feat(STEF-2854): add filterings override to AnalysisConfig

9777e0e

Signed-off-by: Egor Dmitriev <[email protected]>

fix(STEF-2854): resolve ruff lint warnings

96e7015

Signed-off-by: Egor Dmitriev <[email protected]>

fix(STEF-2854): resolve pyright type errors in modified files

6618b4d

Signed-off-by: Egor Dmitriev <[email protected]>

lschilders approved these changes Mar 19, 2026

View reviewed changes

egordm merged commit 7cf3859 into release/v4.0.0 Mar 19, 2026
4 checks passed

egordm deleted the fix/STEF-2854-handle-insufficient-training-data branch March 19, 2026 10:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(STEF-2854): handle backtest robustness issues#837

fix(STEF-2854): handle backtest robustness issues#837
egordm merged 17 commits intorelease/v4.0.0from
fix/STEF-2854-handle-insufficient-training-data

egordm commented Mar 13, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Mar 19, 2026

Uh oh!

lschilders left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

egordm commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fixes

Features

Testing

Uh oh!

sonarqubecloud bot commented Mar 19, 2026

Quality Gate passed

Uh oh!

lschilders left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

egordm commented Mar 13, 2026 •

edited

Loading