fix(STEF-2854): handle backtest robustness issues#837
Merged
egordm merged 17 commits intorelease/v4.0.0from Mar 19, 2026
Merged
fix(STEF-2854): handle backtest robustness issues#837egordm merged 17 commits intorelease/v4.0.0from
egordm merged 17 commits intorelease/v4.0.0from
Conversation
…aining OpenSTEF4BacktestForecaster.fit() now catches InsufficientlyCompleteError alongside FlatlinerDetectedError. When a training window has insufficient non-NaN data, the training event is skipped and the previous model is retained instead of crashing the entire target backtest. Signed-off-by: Egor Dmitriev <[email protected]>
… test Use all-NaN load data with model_reuse_enable=False to trigger InsufficientlyCompleteError naturally instead of patching workflow.fit. Signed-off-by: Egor Dmitriev <[email protected]>
When the first fit fails due to InsufficientlyCompleteError, _workflow stays None. predict() now returns None (like flatliner) instead of raising NotFittedError, letting the benchmark pipeline skip gracefully. Signed-off-by: Egor Dmitriev <[email protected]>
Skip runs/targets with no windowed metrics instead of raising ValueError. Returns an HTML placeholder when all items in a visualization are empty. Signed-off-by: Egor Dmitriev <[email protected]>
… serialization Two fixes: 1. learned_weights_combiner.py: Filter labels to match combined_data index after inner join drops rows from additional_features. Fixes ValueError: operands could not be broadcast together. 2. types.py: Add Pydantic serializer to Quantile to suppress PydanticSerializationUnexpectedValue warnings. Signed-off-by: Egor Dmitriev <[email protected]> Signed-off-by: Egor Dmitriev <[email protected]>
Quantile.__get_pydantic_core_schema__ only defined a validator but no serializer. When Quantile values appear as dict keys in a union type (e.g., QuantileOrGlobal = Quantile | Literal['global']), Pydantic emits PydanticSerializationUnexpectedValue warnings during model_dump_json(). Add a plain_serializer_function_ser_schema(float) so Pydantic knows how to serialize Quantile as a float, preventing the warning. Signed-off-by: Egor Dmitriev <[email protected]> Signed-off-by: Egor Dmitriev <[email protected]>
…in train/test split chronological_train_test_split crashed with IndexError when the dataset had fewer than 2 unique timestamps. This happens during ensemble backtest when a base forecaster's preprocessed data is empty. Now raises InsufficientlyCompleteError which is caught by the backtest harness. Signed-off-by: Egor Dmitriev <[email protected]>
2b726c0 to
e1cd1b2
Compare
…ata after inner join Signed-off-by: Egor Dmitriev <[email protected]>
e1cd1b2 to
625abfe
Compare
…run() Signed-off-by: Egor Dmitriev <[email protected]>
…scope The RUN_AND_GROUP scope was saving directly to the base analysis dir, making has_analysis_output fail to locate it and colliding with group-level outputs. Store in a 'global' subdirectory to match the group-level pattern (group_name/global). Signed-off-by: Egor Dmitriev <[email protected]>
…ons are NaN When a base model cannot predict certain timestamps (e.g. gblinear limited to 2-day weather horizon while lgbm predicts 7 days), the combiner must redistribute the missing model's weight proportionally to the remaining models. Previously, pandas sum(axis=1, skipna=True) silently dropped the NaN model's weight contribution, causing predictions to be systematically scaled down by ~35% for timestamps beyond the weather horizon. Now weights are reindexed to match predictions, zeroed where predictions are NaN, and the weighted sum is divided by the available weight total. When all models are NaN, the result is 0 (matching prior behavior). Includes regression test with seeded data verifying no NaN propagation and no systematic downscaling. Signed-off-by: Egor Dmitriev <[email protected]> Signed-off-by: Egor Dmitriev <[email protected]>
d926bd6 to
d2f64f4
Compare
Extract NaN-aware weight renormalization into a reusable helper in openstef_core.utils.pandas and use it in learned_weights_combiner. Removes type: ignore comments from _predict_quantile. Signed-off-by: Egor Dmitriev <[email protected]>
Allows skipping per-target and global analysis steps when running benchmarks. Useful when analysis will be run separately later via the comparison pipeline. Signed-off-by: Egor Dmitriev <[email protected]>
Signed-off-by: Egor Dmitriev <[email protected]>
Signed-off-by: Egor Dmitriev <[email protected]>
Signed-off-by: Egor Dmitriev <[email protected]>
|
lschilders
approved these changes
Mar 19, 2026
Collaborator
lschilders
left a comment
There was a problem hiding this comment.
LGTM! (mind the Sonarcloud error in one of the files)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Robustness improvements discovered during STEF-2854 model comparison backtest (ensemble benchmarking on STEF50 dataset).
Fixes
InsufficientlyCompleteErrorraised duringWindowedMetricVisualization.format_html()—metric_linecan be shorter thanreference_linewhen the metric time series is incomplete.packages/openstef-beam/src/openstef_beam/evaluation/visualization/windowed_metric_visualization.pyToo strict data validation in
InsufficientlyCompleteError— the error message included the number of unique hours, but the check should only verify that the data is not empty.packages/openstef-core/src/openstef_core/exceptions.pyPydantic serialization warning for
Quantiletype —Quantile.__get_pydantic_core_schema__lacked a serializer, causingUserWarning: Pydantic serializer warningswhen serializing models containingQuantilefields.packages/openstef-core/src/openstef_core/types.pyBroadcast shape mismatch in
LearnedWeightsCombiner.fit()—combine_forecast_input_datasetsuses an inner join that can drop rows, butlabelswas computed before the join, causing a shape mismatch.packages/openstef-meta/src/openstef_meta/models/forecast_combiners/learned_weights_combiner.pyIndexErrorinchronological_train_test_splitwith near-empty datasets — when a dataset has fewer than 2 unique timestamps, the split function crashes withindex 1 is out of bounds.packages/openstef-models/src/openstef_models/utils/data_split.pyEmpty data after inner join in combiner crashes downstream — when
_prepare_input_datainner join produces empty data (additional features have different datetime index), the combiner's predict/score path crashes withValueError: Input data must be 2 dimensional and non emptyorValueError: Input contains NaN._prepare_input_datanow validates the result isn't empty after the inner join and raisesInsufficientlyCompleteError— which naturally propagates up and is caught by the backtest harness, retaining the previous model.packages/openstef-meta/src/openstef_meta/models/forecast_combiners/learned_weights_combiner.pyFeatures
nan_aware_weighted_meanhelper — extracted reusable NaN-aware weighted mean with weight renormalization. Used by ensemble combiner.packages/openstef-core/src/openstef_core/utils/math.pypackages/openstef-core/tests/test_math.pyskip_analysisparameter forBenchmarkPipeline.run()— allows skipping the analysis phase during backtesting when only predictions are needed.packages/openstef-beam/src/openstef_beam/benchmarking/benchmark_pipeline.pyfilteringsoverride inAnalysisConfig— allows manual selection of which filterings (LeadTime, AvailableAt) are included in comparison analysis. Defaults toNone(all subsets included, backward compatible).packages/openstef-beam/src/openstef_beam/analysis/analysis_pipeline.pyTesting
test_quantile_serialization_no_warningsfor fix 3nan_aware_weighted_meanhelper