Correct formatting of field 'sizes' in DataCite public data flle 2025 #227
Replies: 2 comments
-
|
Thanks for the report Bianca. I will investigate to work out what is causing this issue - I've opened a ticket in the repo for the data file generation code here: datacite/alopekis#28. Once identified and fixed, the issue will not be present in the next yearly data file, or monthly data files from that point, but I can't say whether we'll be able to retroactively fix the already issued 2025 yearly data file. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @bmkramer, thanks again for reporting this. We've verified that "sizes" is formatted correctly in the 2025 data file. For example, in 10.4230/lipics.time.2024.20, it appears as Could you share more about how you are processing the data? We have some basic instructions for this in the README file which may be helpful. If you have any further questions, please don't hesitate to reach out to us at [email protected]. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
What is the problem that your suggestion solves?
In the recently released DataCite public data file 2025 (thanks for providing this ❤️ !) the field
attributes.sizesis formatted as a string value, rather than an array of strings.Examples:
"sizes": "['18 pages', '788398 bytes']"(doi: 10.4230/lipics.time.2024.20)"sizes": "['206784']"(doi: 10.15468/dl.jsd232)"sizes": "[]"(most dois)This causes an error when ingesting the jsonl into Google Big Query when this variable is defined as an array of strings.
In the 2024 datafile, the formatting was correct and did not cause an error on ingest:
Examples:
"sizes":["170000kB","a:170000kB"](doi: 10.57451/lhd.langmuir6.165790.4)"sizes":[](most dois)What solution might meet your needs?
It would be great if this formatting could be fixed (i..e. the variable formatted as an array of strings without surrounding double quotation marks) in the public datafile (for this year or at at least next year) and, if applicable, in the monthly data files as well.
Your name
Bianca Kramer
Your organization
Sesame Open Science
What alternatives have you tried or considered?
Potential workaround include:
However, all of these solutions are ad hoc fixes which are not ideal and break backwards and forwards compatibility across data snapshots.
Is there anything else you would like to share?
I have not tested the formatting of this field in DataCite XML or the Rest API output.
What group(s) would benefit from your suggestion?
If other group(s), please describe.
No response
Beta Was this translation helpful? Give feedback.
All reactions