Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyGrinder block and sequence missing algorithms are not reaching the correct percentage of missing values #542

Open
giacomoguiduzzi opened this issue Nov 9, 2024 · 5 comments
Labels
question Further information is requested stale

Comments

@giacomoguiduzzi
Copy link

Issue description

Greetings,

I'm working on a project related to forecasting time series with Deep Learning methods. A quick question about sequence missing and block missing from PyGrinder: I noticed that when I set a replace_pct value of 0.5 I am not actually getting around 50% of missing values, but 39%. If I raise this value to 0.75 then I get around 50%. Is this normal? Am I missing something?
Let me know if there is any additional information I can give you regarding this behaviour.
Thanks in advance, I'm looking forward to your kind response.

Best Regards,
Giacomo Guiduzzi

@giacomoguiduzzi giacomoguiduzzi added the question Further information is requested label Nov 9, 2024
Copy link

This issue had no activity for 14 days. It will be closed in 1 week unless there is some new activity. Is this issue already resolved?

@github-actions github-actions bot added the stale label Nov 24, 2024
@LinglongQian
Copy link
Contributor

Dear Giacomo Guiduzzi,

Thank you for reaching out and sharing your observations about sequence-missing and block-missing behaviour in PyGrinder. The behaviour you’ve described could be due to an interaction between the existing missing data in your dataset and the additional missingness introduced.

If your dataset already contains missing values, the new missing values added will mix with the original ones. This blending effect could result in the observed actual missing rate being lower than the specified value. This issue is particularly noticeable when there are fewer completely observed sequences or blocks in the data to begin with.

Please let me know if this explanation aligns with your situation, or feel free to provide more details about your dataset or experimental setup, and I’d be happy to assist further.

Best regards,
linglong

@github-actions github-actions bot removed the stale label Nov 29, 2024
Copy link

This issue had no activity for 14 days. It will be closed in 1 week unless there is some new activity. Is this issue already resolved?

@github-actions github-actions bot added the stale label Dec 13, 2024
@giacomoguiduzzi
Copy link
Author

Hi @LinglongQian,

Thank you for the answer and availability in giving me support on this issue. I'm sorry for the late reply, it's been a busy period. I tried testing the missing values ratio in various scenarios, both with and without already missing values. My dataset has shape 7152x96x5 of type float32. The original dataset has a MVR (missing values ratio) of 0.041. The MVR I'd like to reach is 0.5. If I use block missing with factor 0.2 I get to 0.4562, while if I use a factor of 0.5 I get 0.4556. For some reason using a higher factor lowers the MVR in output, and I don't really get why. I then tried filling all the NaN values in my dataset with np.nan_to_num(), and applied the same logic: the resulting MVR is 0.4559 with a factor of 0.2, but it becomes as high as 0.7408 with a factor of 0.5. I don't get why it becomes so high. Moreover, while performing these operations, I can read the following messages in the console:

[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=10.666666666666668 > 1
[WARNING]: hit_rate=26.666666666666668 > 1

So apparently this happens even when the original MVR is 0.0 (I double checked after filling in the missing values).

For clarity purposes, here is how I compute the MVR:

def missing_values_ratio(data: Union[pd.DataFrame, TimeSeries, np.ndarray]) -> float:
    """
    Computes the ratio of missing values.
    Copied from darts' code and extended to support numpy arrays.

    Parameters
    ----------
    data
        The time series or numpy array to compute the ratio on

    Returns
    -------
    float
        The ratio of missing values
    """

    if isinstance(data, pd.DataFrame):
        return data.isnull().sum().mean() / len(data)
    elif isinstance(data, pd.Series):
        return data.isnull().sum() / len(data)
    elif isinstance(data, TimeSeries):
        return data.pd_dataframe().isnull().sum().mean() / len(data)
    elif isinstance(data, np.ndarray):
        return np.isnan(data).sum() / data.size
    else:
        raise ValueError(f"The data type {type(data).__name__} is not supported.")

Let me know what you think about this situation and if there is anything more I can do to help you identify the issue.
Thank you in advance, I'm looking forward to your kind response.

@github-actions github-actions bot removed the stale label Dec 18, 2024
Copy link

github-actions bot commented Jan 1, 2025

This issue had no activity for 14 days. It will be closed in 1 week unless there is some new activity. Is this issue already resolved?

@github-actions github-actions bot added the stale label Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stale
Projects
None yet
Development

No branches or pull requests

2 participants