Open
Description
Hi,
recently I have stumbled across a potential data quality issue. Apparently, there are some duplicate observations in the SDO ML v2 dataset.
For example for the following times in 171A (fdl-sdoml-v2/sdomlv2.zarr/2020/171A):
{
"171A": [
"2019-06-21T00:00:10.35Z",
"2019-06-21T00:06:10.35Z",
"2019-06-21T00:12:10.35Z",
"2019-06-21T00:18:10.35Z",
"2019-06-21T00:24:10.35Z",
"2019-06-21T00:30:10.35Z",
"2019-06-21T00:36:10.35Z",
"2019-06-21T00:42:10.35Z",
"2019-06-21T00:48:10.34Z",
....
]
}
Given a Pytorch DataLoader (an example can be found here), the issue can be reproduced as follows:
from sdo.sood.data.sdo_ml_v2_dataset import SDOMLv2NumpyDataset, get_default_transforms
from torch.utils.data import DataLoader
storage_root = "/data/sdomlv2_full/sdomlv2.zarr"
storage_driver = "fs"
year = None
channel="171A"
cache_max_size = 2*1024*1024*2014
target_size=512
transforms = get_default_transforms(
target_size=target_size, channel=channel)
dataset = SDOMLv2NumpyDataset(
storage_root=storage_root,
storage_driver=storage_driver,
cache_max_size=cache_max_size,
year=year,
channel=channel,
transforms=transforms,
start=None,
end=None,
freq=None,
irradiance=None,
irradiance_channel=None,
goes_cache_dir=None,
reduce_memory=True,
obs_times=None
)
loader = DataLoader(dataset, batch_size=64,
shuffle=False,
num_workers=16,
prefetch_factor=2)
seen = set()
duplicates = [x for x in loader.dataset.attrs["T_OBS"] if x in seen or seen.add(x)]
duplicates
Please check whether these observations need to be removed.
Cheers,
Marius
Metadata
Metadata
Assignees
Labels
No labels