-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible new check: duplicate entity #10
Comments
Hmm, but isn’t happening frequently and that is why Duane implemented the option to reuse the same tables when the MD5 hasn’t changed. Why are you even looking at the older version?
|
The reason I'm seeing both of these is that they turned up side-by-side in
a search, but I'm betting that they shouldn't have.
This situation would not have been caught by the md5 check, but really - it
reflects a flaw in a local system (keeping track of packageids), so is
probably out of scope.
|
This is a housekeeping problem for the site and not the only one of their datasets affected. But since they know about it, it's not a good use case for a check. |
A housekeeping problem for the site for sure but I wonder if a check that would help sites/data-set-submitters avoid such issues might be worth considering given that, once submitted, it is no longer just a local problem but now a community problem. Data packages cannot be deleted so now there are these duplicates, which is not the end-of-the-world but also not a good situation. |
There may be a check of some form, e.g., a warn that "this exact entity is associated with another EML record!" So, OK - we can leave it! thanks @srearl BTW, the solution (to the duplicate) is still technically called a"delete" although datasets are only "deleted-from-index" (they are not actually deleted, only archived). the process needs to get written up -- it falls into the BPs-for-working-with-pasta, so I guess it's mine, and belongs here: https://github.com/EDIorg/dm-best-practices |
Here is legitimate reason for one table to show up in multiple datasets: the table is a species list, and the submitter has multiple datasets that use the same species table. There are alternatives (e.g., create a unique dataset for the species list table), but packaging them together can be more convenient. So a check that looks for duplicate entities should not return a warn, because we should not imply it's not a legitimate thing to do. |
Yeah, but @mobb is that not a different issue? Having the same data entities in different data packages I do not think is a problem. In fact, I do that often and purposefully, for example when the same spatial data apply to data in different data packages. This issue here, I thought, was concerning duplicate data packages (not data entities across packages). |
This issue describes a check for duplicate entities with entity checksums. duplicate packages would be harder. And yes, the example that started this is actually a duplicate dataset. |
I think the duplicate data set (not duplicate data entities in different packages) is worth considering. Maybe as a new issue but either way. |
I bumped into 2 datasets that appear identical; it seems that only the packageIds are different.
https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-jrn.2100351001.45
https://portal.edirepository.org/nis/metadataviewer?packageid=knb-lter-jrn.210351001.47
same entity, same md5 hash.
Possibly, rev.47 was intended to be a metadata-only update of rev 45. The diff is below. I don't know if there is any way to trap this; possibly by checking the md5 hash regardless, and alerting the user that "hey, this entity is already assoicated with a packageID, do you really want to upload it again?" But maybe that is too much hand-holding. I contacted the site IM and let him know.
Here is the diff:
The text was updated successfully, but these errors were encountered: