Using CSVW to annotate Soil Observation Data #181

pvgenuchten · 2026-01-29T11:20:40Z

pvgenuchten
Jan 29, 2026

Hi team, let me pitch an idea here which we're investigating to share and later harmonize soil observation data (we is a cluster of ISRIC/Wageningen university in the scope of Soilwise HE project)

Background

Many researchers share their data using tabular formats (excel, csv, dbf), usually in combination with a report or readme where the individual fields are explained (featueofinterest, observed property, unit of measure, procedure). Our thought was, if we can endorse researchers to use a machine readable format for the readme, machines would be able to combine both facts into a rich data structure.
The W3C csvw approach seems a standardized approach to cover this scenario.

Approach

We identified csvwlib as a lean solution to work with csvw annotated data. We used the tool to set up a number of data experiments in the soilwise repository. Example 3 shows nicely how a csv is converted to SOSA triples using the csvw annotations. We also have an approach using schema.org ontology. Another experimental script will be able to transform the graph to a relational database following the iso28258 structure. Also interesting is a shacl validation to validate any SOSA graph.

Findings

Although the approach seems fully valid and capable to manage the case, we noticed it is challenging for soil scientists to compile a csvw annotations file. So we are designing tooling which will support that activity. A vba tool within excel, a web tool and a llm based tool. But also an intermediate step, in which scientists add the required information (featueofinterest, observed property, unit of measure, procedure) in a basic CSV format.

Welcoming your thoughts/ideas, bye Paul.

ktoddbrown · 2026-01-29T19:42:01Z

ktoddbrown
Jan 29, 2026
Maintainer

Hi Paul,

CSVW

csvw does look like a nice light weight solution for table annotation. It seems very similar to YAML and JSON in many respects and comparing the three could be really interesting. Why did you pick csvw rather then yaml or json here, it seems like an obscure choice?

This triggered a lot of other thoughts for me about metadata and what we ask of data providers that I'll go into in other comments to try to keep the reply threads cleaner.

-Kathe

1 reply

pvgenuchten Jan 30, 2026
Author

In my search on conventions to implement this case, i've noticed a bunch of initiatives, such as:

OKFN table schema
iso19110 as part of the TC211 iso19115 suite of standards
STAT-DCAT-AP
pygeometa is actually a metadata abstraction in yaml, which i use a lot (yaml behaves well in git)
Some platforms (figshare) have a semi-structured template for readme.txt
RML.io (and yarrrml) have similar goal, but more extended

Many of the above initiatives are unfortunately a bit too limited, to describe the richness of observation data. For every observed value we want to properly capture its context (observation data is more metadata then data). On the other side there are the initiatives which are too complex for the intended audience. Other initiatives can not easily share a configuration file which can be shared along with the data file. CSVW actually uses json for its metadata (or maybe your question relates to our experimental CSV format which replaces the json for csv, it was a wish from our excel oriented scientists, which prefer everything in tables).

CSVW approach
data -> in csv format
metadata -> in json

Why json? I love yaml, but json is the format selected by that W3C group. On the other hand json and yml are compatible, so I could indeed write as yml and save as json...

I certainly do appreciate your efforts here. However I think the community can benefit from an abstraction between data and code, so one doesn't need to be an FME expert, a java programmer or R statistician to contribute to a harmonisation. And as always, there are r packages for csvw

ktoddbrown · 2026-01-29T19:42:52Z

ktoddbrown
Jan 29, 2026
Maintainer

No metadata, only data

I'm increasingly coming to the view that there is no metadata only data when considering researcher provided information. This has lead to us adopting a generalized data tuple <id> - of_variable - is_type - with_entry - from_source. of_variable is simpler to a traditional 'observed property', is_type includes things like value, variance, uncertainty, unit, and method, and with_entry is the data itself from a declared source in from_source.

Most of the meta-data in this case then becomes the vocabulary that supports these fields rather then information from the researcher. I'll admit that we are just starting to really test this generalized format in production but it's prototyped well so far. This puts the burden of stitching data together from multiple sources on the re-use team not on the original data providers.

1 reply

pvgenuchten Jan 30, 2026
Author

I'll put on my list to dive in a bit.

I like the idea of shifting the efforts from producer to the users, it was one of my thoughts behind csvw; any one in the community can generate the csvw annotation file, even for legacy data.

ktoddbrown · 2026-01-29T19:47:55Z

ktoddbrown
Jan 29, 2026
Maintainer

Unstandardized standards

Data providers shouldn't be using templates or standardized data models, especially for research results. From a philosophical prospective, research is all about novelty so this often translates into a new non-standard measurement, sample prep, or treatment. Using something like drop-down terms/codes then become problematic when the exact variant isn't there. In addition, the level of detail needed to be captured varies for each synthesis end use. All of this leads to very frustrated data providers in my experience.

Instead we should honor how researchers already share data: tables, figures, methods, and protocols. Researchers organize their data according to their mental model of the system, are already trained to write reproducible documentation (methods sections), and many even create protocols for lab technicians or colleagues to follow when collecting the data. Digitizing and linking these data into coherent collections is then fit for purpose for the reanalysis.

All that being said, I do have one ask to our data providing colleagues: Please use a flat text file rather then some proprietary spreadsheet/database! So I guess I do have standards here :)

1 reply

pvgenuchten Jan 29, 2026
Author

I love your review here, adopting a standard should not limit the creativity/novelty of science, and indeed, current practices on experimental design and reviewing systems already facilitate reproducability.

However I do feel academia can benefit from writing down it's results in a more structured way, so also machines can more accurately parse the information, instead of depending on llm's to parse reports (even though llms probably help a lot).

The procedure or approach may be novel in a research, many other aspects are not: such as time, unit, location, chemistry... The standard aspects can best be captured using standard vocabularies. But indeed for the novel aspects (procedures/uncertainty) any data archiving approach should offer enough capabilities to be extended to capture the unknown.

Endorsing csv/yml/json over excel/spss/matlab formats, is a part of that puzzle
text based vs binary (but standardised) is a different (interesting) dimension

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using CSVW to annotate Soil Observation Data #181

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Using CSVW to annotate Soil Observation Data #181

Uh oh!

Uh oh!

pvgenuchten Jan 29, 2026

Background

Approach

Findings

Replies: 3 comments · 3 replies

Uh oh!

ktoddbrown Jan 29, 2026 Maintainer

CSVW

Uh oh!

Uh oh!

pvgenuchten Jan 30, 2026 Author

Uh oh!

ktoddbrown Jan 29, 2026 Maintainer

No metadata, only data

Uh oh!

pvgenuchten Jan 30, 2026 Author

Uh oh!

ktoddbrown Jan 29, 2026 Maintainer

Unstandardized standards

Uh oh!

Uh oh!

pvgenuchten Jan 29, 2026 Author

pvgenuchten
Jan 29, 2026

Replies: 3 comments 3 replies

ktoddbrown
Jan 29, 2026
Maintainer

pvgenuchten Jan 30, 2026
Author

ktoddbrown
Jan 29, 2026
Maintainer

pvgenuchten Jan 30, 2026
Author

ktoddbrown
Jan 29, 2026
Maintainer

pvgenuchten Jan 29, 2026
Author