Skip to content

Identify and remove duplicated marine in-situ observation #211

@apchoiCMD

Description

@apchoiCMD

The issue is not just concatenation itself; it is recursive concatenation.

For example, if the 2026042200 file is made by concatenating:

* 2026042100
* 2026042106
* 2026042112
* 2026042118
* 2026042200

then that can be OK only if those are raw, non-concatenated source files.

But if each of those input files was already produced by concatenating the previous window, then we are re-ingesting observations that were already included in earlier cycles.

So the duplication pattern looks like this:

cycle 00 = raw 00 + raw 18 + raw 12 + raw 06 + raw 00_prev
cycle 06 = cycle 00 + raw 06 + raw 00 + raw 18 + raw 12

At that point, cycle 06 contains observations from cycle 00, and cycle 00 already contains several earlier files. The overlap compounds every cycle.

The global attribute confirms this:

obs_source_files =
  gdas.t00z.insitu_profile_argo.2026042100.nc,
  gdas.t06z.insitu_profile_argo.2026042106.nc,
  gdas.t12z.insitu_profile_argo.2026042112.nc,
  gdas.t18z.insitu_profile_argo.2026042118.nc,
  gdas.t00z.insitu_profile_argo.2026042200.nc

If any of those source files are themselves concatenated products, then we are not building a 24-hour window from unique raw inputs; we are recursively accumulating prior windows.

The fix is to concatenate only from the original per-cycle source files, or explicitly de-duplicate after concatenation using a stable observation key, such as platform/profile ID + dateTime + lat/lon + depth, depending on what metadata are available in the IODA file.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions