Skip to content

Duplicate observations #2

Open
Open
@mariusgiger

Description

@mariusgiger

Hi,

recently I have stumbled across a potential data quality issue. Apparently, there are some duplicate observations in the SDO ML v2 dataset.

For example for the following times in 171A (fdl-sdoml-v2/sdomlv2.zarr/2020/171A):

{
  "171A": [
    "2019-06-21T00:00:10.35Z",
    "2019-06-21T00:06:10.35Z",
    "2019-06-21T00:12:10.35Z",
    "2019-06-21T00:18:10.35Z",
    "2019-06-21T00:24:10.35Z",
    "2019-06-21T00:30:10.35Z",
    "2019-06-21T00:36:10.35Z",
    "2019-06-21T00:42:10.35Z",
    "2019-06-21T00:48:10.34Z",
....
]
}

Given a Pytorch DataLoader (an example can be found here), the issue can be reproduced as follows:

from sdo.sood.data.sdo_ml_v2_dataset import SDOMLv2NumpyDataset, get_default_transforms
from torch.utils.data import DataLoader

storage_root = "/data/sdomlv2_full/sdomlv2.zarr"
storage_driver = "fs"
year = None
channel="171A"
cache_max_size =  2*1024*1024*2014
target_size=512
transforms = get_default_transforms(
            target_size=target_size, channel=channel)

dataset = SDOMLv2NumpyDataset(
                storage_root=storage_root,
                storage_driver=storage_driver,
                cache_max_size=cache_max_size,
                year=year,
                channel=channel,
                transforms=transforms,
                start=None,
                end=None,
                freq=None,
                irradiance=None,
                irradiance_channel=None,
                goes_cache_dir=None,
                reduce_memory=True,
                obs_times=None
)

loader = DataLoader(dataset, batch_size=64,
                          shuffle=False,
                          num_workers=16,
                          prefetch_factor=2)

seen = set()
duplicates = [x for x in loader.dataset.attrs["T_OBS"] if x in seen or seen.add(x)]   
duplicates

Please check whether these observations need to be removed.

Cheers,
Marius

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions