Skip to content

Annotations

Kathe Todd-Brown edited this page May 23, 2024 · 3 revisions
-- annotations
Lead/prep Read though documentation.
Do Annotate the dataset with id-variable-type-entry tuple.
Measure Do the annotations match the data? Are the level of method descriptions clear?

Definitions and examples

Annotations, or metadata, refers to information about the data. This can include long names or descriptions of columns, units associated with those columns, control vocabularies, or methods. Sometimes this also includes context that is applicable to the entire dataset that would be known within the orginal publication but might be lost when moving the data out of that context. Sample time, location or soil type are examples of data that is often in methods sections or introductions of a report or publication rather then inside a data table, that may be lost when that data is integrated into a larger collection.

While there are many forms that an annotation can take we have settled on a 4 part tuple: (id, of_variable, is_type, with_entry) An id are the column identifiers and often consists of a table_id and a column_id. The table_id is generally the base name of the csv/tsv file that is associated with the column_id, which is an exact match for the names of the columns in the associated tables. All column_id must be associated with a table_id but information may be associated with a table_id and not a specific column.

Entries in the of_variable column describe the variables that is associated with the id. These are generally observable properties like bulk density, organic carbon, ecotype, or soil color. Often these represent how the researcher sees their data, for example do they distinguish between organic carbon fraction and carbon fraction. In creating these annotations you should strive to reflect the original purpose or intent of the data providers as much as possible. In context to the highly data set dependent of_variable entries, is_type should have a more restricted set of entries.

The is_type entry commonly falls into one of the following {value, definition, unit, method, control_vocabulary}. This describes the kind of information associated with of_variable that is found in the with_entry. Note that we have chosen not to distinguish between numerical or text values, and instead just use the term 'value' for simplicity. Currently control_vocabularies are key-definition pairs where | separate the key and definition and ; separate the different keys.

The with_entry column is either the entry associated with the variable and type described above or a special notation that indicates you should refer to the data set being described. We currently use -- for data set reference.

Annotation of an annotation table

To give an example, let's consider the annotation table itself as described above.

column_id of_variable is_type with_entry
column_id column_header identifier --
column_id column_header description The column name or header or identifier that links to the table being annotated.
of_variable variable value --
of_variable variable description The property or variable that the information in this column is associated with.
is_type type value --
is_type type description The dimension or type of information associated with a specific variable and column
is_type type control_vocabulary 'value|the observation or measurement; definition|a human readable definition of the variable; unit|unit of measure for the observation; control_vocabulary|paired term definitions used to describe the variable'
with_entry entry value --
with_entry entry description the entry associated with the id-variable-type tuple or a '--' to denote reference to the data table being annotated

Read the documentation

Gathering the information to create an annotation requires reading the contextual information provided with the data. This can include reading associated publications or reports, reviewing metadata on archives, or talking with a point of contact familiar with the data. Documentation resources will generally fall into structured machine readable formats (ie EML metadata on an archive), unstructured machine readable (ie text tables or structured lists in pdf documents), and human driven (ie paragraph embedded information or correspondence with other researchers). While it is possible to create an annotation entirely by hand, often times text processing scripts can help reduce errors when working with machine readable text. If you do use text processing, documenting that text processing at the end of a integration report is good practice to ensure transparency. In all cases, running checks on your data annotation structure (described below) will reduce frustrations down the line.

Generally you are looking to identify human readable descriptions of what the columns names refer to, associated units or methods, location, collection time information, and control vocabulary.

Checking the annotations

As a final step annotations should be checked for completeness. Column and table id should be cross referenced with the data set itself to ensure that all columns are described as desired (for large data sets, a partial annotation of only desired may be appropriate). Variables and types should be checked for spelling errors (for example the unique function in R can be used to check for spelling variations).

Finally reviewing the annotations table with an expert in the data set or second reviewer is recommended at this stage.

Clone this wiki locally