PyGrinder block and sequence missing algorithms are not reaching the correct percentage of missing values #542

giacomoguiduzzi · 2024-11-09T15:01:44Z

Issue description

Greetings,

I'm working on a project related to forecasting time series with Deep Learning methods. A quick question about sequence missing and block missing from PyGrinder: I noticed that when I set a replace_pct value of 0.5 I am not actually getting around 50% of missing values, but 39%. If I raise this value to 0.75 then I get around 50%. Is this normal? Am I missing something?
Let me know if there is any additional information I can give you regarding this behaviour.
Thanks in advance, I'm looking forward to your kind response.

Best Regards,
Giacomo Guiduzzi

github-actions · 2024-11-24T00:20:13Z

This issue had no activity for 14 days. It will be closed in 1 week unless there is some new activity. Is this issue already resolved?

LinglongQian · 2024-11-28T23:52:00Z

Dear Giacomo Guiduzzi,

Thank you for reaching out and sharing your observations about sequence-missing and block-missing behaviour in PyGrinder. The behaviour you’ve described could be due to an interaction between the existing missing data in your dataset and the additional missingness introduced.

If your dataset already contains missing values, the new missing values added will mix with the original ones. This blending effect could result in the observed actual missing rate being lower than the specified value. This issue is particularly noticeable when there are fewer completely observed sequences or blocks in the data to begin with.

Please let me know if this explanation aligns with your situation, or feel free to provide more details about your dataset or experimental setup, and I’d be happy to assist further.

Best regards,
linglong

github-actions · 2024-12-13T00:18:53Z

This issue had no activity for 14 days. It will be closed in 1 week unless there is some new activity. Is this issue already resolved?

giacomoguiduzzi · 2024-12-17T13:44:51Z

Hi @LinglongQian,

Thank you for the answer and availability in giving me support on this issue. I'm sorry for the late reply, it's been a busy period. I tried testing the missing values ratio in various scenarios, both with and without already missing values. My dataset has shape 7152x96x5 of type float32. The original dataset has a MVR (missing values ratio) of 0.041. The MVR I'd like to reach is 0.5. If I use block missing with factor 0.2 I get to 0.4562, while if I use a factor of 0.5 I get 0.4556. For some reason using a higher factor lowers the MVR in output, and I don't really get why. I then tried filling all the NaN values in my dataset with np.nan_to_num(), and applied the same logic: the resulting MVR is 0.4559 with a factor of 0.2, but it becomes as high as 0.7408 with a factor of 0.5. I don't get why it becomes so high. Moreover, while performing these operations, I can read the following messages in the console:

[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=26.666666666666668 > 1

So apparently this happens even when the original MVR is 0.0 (I double checked after filling in the missing values).

For clarity purposes, here is how I compute the MVR:

def missing_values_ratio(data: Union[pd.DataFrame, TimeSeries, np.ndarray]) -> float:
    """
    Computes the ratio of missing values.
    Copied from darts' code and extended to support numpy arrays.

    Parameters
    ----------
    data
        The time series or numpy array to compute the ratio on

    Returns
    -------
    float
        The ratio of missing values
    """

    if isinstance(data, pd.DataFrame):
        return data.isnull().sum().mean() / len(data)
    elif isinstance(data, pd.Series):
        return data.isnull().sum() / len(data)
    elif isinstance(data, TimeSeries):
        return data.pd_dataframe().isnull().sum().mean() / len(data)
    elif isinstance(data, np.ndarray):
        return np.isnan(data).sum() / data.size
    else:
        raise ValueError(f"The data type {type(data).__name__} is not supported.")

Let me know what you think about this situation and if there is anything more I can do to help you identify the issue.
Thank you in advance, I'm looking forward to your kind response.

github-actions · 2025-01-01T00:19:12Z

This issue had no activity for 14 days. It will be closed in 1 week unless there is some new activity. Is this issue already resolved?

giacomoguiduzzi added the question Further information is requested label Nov 9, 2024

github-actions bot added the stale label Nov 24, 2024

github-actions bot removed the stale label Nov 29, 2024

github-actions bot added the stale label Dec 13, 2024

github-actions bot removed the stale label Dec 18, 2024

github-actions bot added the stale label Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyGrinder block and sequence missing algorithms are not reaching the correct percentage of missing values #542

PyGrinder block and sequence missing algorithms are not reaching the correct percentage of missing values #542

giacomoguiduzzi commented Nov 9, 2024

github-actions bot commented Nov 24, 2024

LinglongQian commented Nov 28, 2024

github-actions bot commented Dec 13, 2024

giacomoguiduzzi commented Dec 17, 2024

github-actions bot commented Jan 1, 2025

PyGrinder block and sequence missing algorithms are not reaching the correct percentage of missing values #542

PyGrinder block and sequence missing algorithms are not reaching the correct percentage of missing values #542

Comments

giacomoguiduzzi commented Nov 9, 2024

Issue description

github-actions bot commented Nov 24, 2024

LinglongQian commented Nov 28, 2024

github-actions bot commented Dec 13, 2024

giacomoguiduzzi commented Dec 17, 2024

github-actions bot commented Jan 1, 2025