Duplicate observations

Hi,

recently I have stumbled across a potential data quality issue. Apparently, there are some duplicate observations in the SDO ML v2 dataset. 

For example for the following times in 171A (fdl-sdoml-v2/sdomlv2.zarr/2020/171A):

```
{
  "171A": [
    "2019-06-21T00:00:10.35Z",
    "2019-06-21T00:06:10.35Z",
    "2019-06-21T00:12:10.35Z",
    "2019-06-21T00:18:10.35Z",
    "2019-06-21T00:24:10.35Z",
    "2019-06-21T00:30:10.35Z",
    "2019-06-21T00:36:10.35Z",
    "2019-06-21T00:42:10.35Z",
    "2019-06-21T00:48:10.34Z",
....
]
}
```

Given a Pytorch DataLoader (an example can be found [here](https://github.com/i4Ds/sdo-cli/blob/main/src/sdo/sood/data/sdo_ml_v2_dataset.py)), the issue can be reproduced as follows:

```
from sdo.sood.data.sdo_ml_v2_dataset import SDOMLv2NumpyDataset, get_default_transforms
from torch.utils.data import DataLoader

storage_root = "/data/sdomlv2_full/sdomlv2.zarr"
storage_driver = "fs"
year = None
channel="171A"
cache_max_size =  2*1024*1024*2014
target_size=512
transforms = get_default_transforms(
            target_size=target_size, channel=channel)

dataset = SDOMLv2NumpyDataset(
                storage_root=storage_root,
                storage_driver=storage_driver,
                cache_max_size=cache_max_size,
                year=year,
                channel=channel,
                transforms=transforms,
                start=None,
                end=None,
                freq=None,
                irradiance=None,
                irradiance_channel=None,
                goes_cache_dir=None,
                reduce_memory=True,
                obs_times=None
)

loader = DataLoader(dataset, batch_size=64,
                          shuffle=False,
                          num_workers=16,
                          prefetch_factor=2)

seen = set()
duplicates = [x for x in loader.dataset.attrs["T_OBS"] if x in seen or seen.add(x)]   
duplicates
```

Please check whether these observations need to be removed.

Cheers,
Marius

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Duplicate observations #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Duplicate observations #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions