Phenobase data loading prepares source observation releases for indexing in the Phenobase Elasticsearch datastore. The workflow begins by rebuilding the ontology-derived trait hierarchy, then uses that trait mapping during dataset normalization, validation, and loading so incoming observations can be indexed with consistent trait_urn, trait, and mappedTraits values. For the current published trait outputs, use the traits viewer and the published traits.csv.
This repository supports the Phenobase data pipeline around three core jobs:
- Rebuild the ontology-derived trait hierarchy in
data/traits.csv. - Load source CSV datasets into the Elasticsearch index used by Phenobase.
- Maintain and inspect the live datastore with export and backfill helpers.
The trait reasoning step comes first. The loader depends on data/traits.csv to resolve each incoming trait_urn into the current canonical trait label and the derived mappedTraits hierarchy used later for indexing and querying.
Reasoning is the first step because it generates the trait lookup consumed during ingestion.
Source of truth for the current workflow:
- PPO GitHub
main:https://raw.githubusercontent.com/PlantPhenoOntology/ppo/refs/heads/main/ppo.owl
For repeatability, this repo supports two ways to regenerate the trait mapping:
- the default code method, which writes the canonical data/traits.csv
- a separate ROBOT/SPARQL method, which writes reasoning/robot/traits.csv and a comparison report against the canonical file
Run the canonical rebuild from the repo root:
python3 reasoning/refresh_traits.pyCompatibility wrapper:
./reasoning/get_traits.shThis method:
- downloads the current PPO ontology from GitHub
main - snapshots the exact ontology used under
reasoning/<version>/ppo.owl - regenerates
data/traits.csv - publishes
docs/traits.csvfor GitHub Pages - publishes
docs/traits-data.jsonfor the static viewer - writes
reasoning/traits_build_metadata.json
The SPARQL query is kept in its own file for repeatability:
The direct ROBOT query command is:
../robot/robot query \
--input reasoning/2026-05-06/ppo.owl \
--query reasoning/robot/traits_pairs.sparql reasoning/robot/traits_pairs.csvThat query emits flat trait-to-mapped-trait pairs. To group those pairs back into the Phenobase
traits.csv shape and compare them with the canonical file, run:
python3 reasoning/refresh_traits_robot.py --skip-query \
--input-owl reasoning/2026-05-06/ppo.owl \
--pairs-output reasoning/robot/traits_pairs.csvIf you want the helper to run both the ROBOT query and the post-processing for you, use:
python3 reasoning/refresh_traits_robot.pyThe ROBOT/SPARQL path writes:
reasoning/robot/traits_pairs.csv: raw ROBOT query outputreasoning/robot/traits.csv: grouped alternative traits outputreasoning/robot/comparison.json: comparison todata/traits.csv
Current comparison summary for the local 2026-05-06 snapshot:
189/189traits matched by mapped-ID set semantics170/189rows matched exactly- the remaining differences are ordering differences in
mappedTraitIDsandmappedTraits
Current reasoning artifacts:
- reasoning/refresh_traits.py
- reasoning/refresh_traits_robot.py
- reasoning/traits_build_metadata.json
- reasoning/2025-05-05/ppo.owl
- reasoning/2026-05-06/ppo.owl
- reasoning/robot/traits_pairs.sparql
- reasoning/robot/traits_pairs.csv
- reasoning/robot/traits.csv
- reasoning/robot/comparison.json
- data/traits.csv
- docs/traits.csv
- docs/traits-data.json
Quick verification after a rebuild:
python3 -m json.tool reasoning/traits_build_metadata.json
git diff -- data/traits.csv docs/traits.csv docs/traits-data.json
python3 -m json.tool reasoning/robot/comparison.jsonTrait mapping rules used by the rebuild:
- labels ending in
presentare included - labels ending in
absentare included presenttraits map to themselves plus transitive named PPO superclass traits that also end inpresentabsenttraits map only to themselves
Use this sequence for a new Phenobase data release:
-
Refresh the ontology-driven trait mapping with
python3 reasoning/refresh_traits.py. -
Review
reasoning/traits_build_metadata.jsonand diffdata/traits.csv. -
Place source
.csvfiles in the release directory you want to ingest. -
Add dataset-local
mappings.csvand/ortransform.yamlif source-specific cleanup, case normalization, regex mapping, null handling, or trait remapping is needed. Usemappings.csvfor source-to-PPO ID lookup andtransform.yamlfor row-level cleanup. -
Confirm the loader inputs are in place:
data/columns.csvdefines field datatypes, requiredness, and system fieldsdata/traits.csvprovides the PPO ID-to-trait-and-mappedTraitsexpansion generated by the reasoning steptransform.yamlis optional and only applies to the dataset directory being loaded
-
Run a dry ingestion pass:
python3 loader.py --mode=<machine|in_situ|herbarium> --test --no-drop-existing <data_dir>-
Review
loading_errors.csv. Common failures are:- missing
annotationID - duplicate
annotationIDwithin an input file - empty
traitandtrait_urn trait_urnvalues not found indata/traits.csv- legacy
traitvalues not found indata/traits.csvwhen notrait_urnis provided - strict-mode coercion failures for typed fields
- missing
-
Fix the source files or the dataset-local
transform.yaml, then rerun test mode until the release is acceptable. -
Run the real ingestion:
python3 loader.py --mode=<machine|in_situ|herbarium> --no-drop-existing <data_dir>- If you are rebuilding the index from scratch rather than appending or updating, use:
python3 loader.py --mode=<machine|in_situ|herbarium> --drop-existing <data_dir>- If the live index needs
decadeStart, run:
python3 update_decade_start.py --waitThe main loader is loader.py.
What the loader does during a run:
- recursively finds
.csvfiles under the provided dataset directory - applies optional row-level transforms from dataset-local
transform.yaml - coerces values to the datatypes defined in
data/columns.csv - computes system fields such as
mappedTraitsanddecadeStart - rejects rows with missing
annotationID, duplicateannotationIDwithin a file, or unmapped traits - writes row-level failures to
loading_errors.csv - indexes to
phenobase2usingannotationIDas the Elasticsearch document_id
Supported loading modes:
machinein_situherbarium
Example commands:
python3 loader.py --mode=machine data/annotations.07.25.2025/ --no-drop-existing --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ data/npn.1956.01.01-2025.08.31/ --no-drop-existing --batch-size 5000 --progress-every 50000For a clean rebuild, generate each dataset-specific loader CSV first, dry-run each directory, then run the real loads. The first real load should use --drop-existing so the Elasticsearch index starts fresh; every later dataset load should use --no-drop-existing so records are appended or updated into the same index.
Refresh the source CSVs:
# NPN / NEON, from downloads/npn
cd downloads/npn
python3 fetchAndTransformNPNData.py 1956-01-01 2026-05-17 ./mappings.csv
mkdir -p ingest
mv npn_observations_1956-01-01_to_2026-05-17.csv ingest/
cd ../..
# PhenoObs, from repo root
python3 downloads/phenoObs/prepare_phenoobs.py \
--raw-root downloads/phenoObs \
--mappings downloads/phenoObs/mappings.csv \
--output downloads/phenoObs/ingest/phenoObs_observations.csv
# Budburst, from repo root
python3 downloads/budburst/fetch_budburst.py \
--output downloads/budburst/ingest/budburst_observations.csv \
--workers 3 \
--timeout 300 \
--retries 8
# SeasonWatch India, from repo root. Downloads the DwC-A only if missing.
python3 downloads/seasonwatchindia/fetch_seasonwatchindia.py \
--output downloads/seasonwatchindia/ingest/seasonwatchindia_observations.csv
# iNaturalist
# Note that we have another set of procedures to generate the iNaturalist archive not covered here
unzip -j downloads/iNaturalist/ingest/inat.csv.gz
# Herbarium
# Note that we have another set of procedures to generate the herbarium archive not covered here
mkdir -p downloads/herbarium/ingest
unzip -j downloads/herbarium/herb_flower_inference_9.8.25.csv.zip 'flower_inference_formatted_edit_9.8.25.csv' -d downloads/herbarium/ingestRun validation first:
python3 loader.py --mode=in_situ --test --no-drop-existing downloads/npn/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --test --no-drop-existing downloads/phenoObs/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --test --no-drop-existing downloads/budburst/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --test --no-drop-existing downloads/seasonwatchindia/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --test --no-drop-existing downloads/iNaturalist/ingest --batch-size 5000 --progress-every 50000Run the real in-situ reload. Drop the index only on the first dataset:
python3 loader.py --mode=in_situ --drop-existing downloads/npn/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --no-drop-existing downloads/phenoObs/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --no-drop-existing downloads/budburst/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --no-drop-existing downloads/seasonwatchindia/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --no-drop-existing downloads/iNaturalist/ingest --batch-size 5000 --progress-every 50000Load herbarium or machine-derived datasets after the in-situ sources. Keep using --no-drop-existing:
python3 loader.py --mode=herbarium --test --no-drop-existing downloads/herbarium/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=herbarium --no-drop-existing downloads/herbarium/ingest --batch-size 5000 --progress-every 50000PhenoObs source preparation writes one loader-ready CSV, similar to the NPN transformer. The script auto-detects raw rawdata_PhenObs_*.csv files under downloads/phenoObs first, then falls back to data/phenoObs. The output should go into a clean loader directory that also contains transform.yaml:
cd downloads/phenoObs
python3 prepare_phenoobs.py ./mappings.csv --output ingest/phenoObs_observations.csv
cd ../..
python3 loader.py --mode=in_situ --test --no-drop-existing downloads/phenoObs/ingest --batch-size 5000 --progress-every 50000
python3 loader.py --mode=in_situ --no-drop-existing downloads/phenoObs/ingest --batch-size 5000 --progress-every 50000From the repo root, the equivalent explicit form is:
python3 downloads/phenoObs/prepare_phenoobs.py --raw-root downloads/phenoObs --mappings downloads/phenoObs/mappings.csv --output downloads/phenoObs/ingest/phenoObs_observations.csvMain options:
--test: validate and simulate without writing to Elasticsearch--strict: reject rows with coercion or validation problems--drop-existing: recreate the target index before loading--batch-size: bulk size for indexing--progress-every: progress logging interval--no-drop-existing: keep the current index and upsert documents into it
What the loader relies on:
data/columns.csvfor schema and required fieldsdata/traits.csvfortrait_urntotraitandmappedTraitsexpansion- optional dataset-local
transform.yamlfor normalization rules
Place transform.yaml inside the dataset directory being loaded. The loader checks for
<data_dir>/transform.yaml automatically. If present, the file is applied row-by-row before
datatype coercion, required-field checks, and mappedTraits derivation.
Use transform.yaml for:
- trimming and case-normalizing raw source fields
- regex-based cleanup of source strings
- mapping raw trait labels to canonical Phenobase trait labels during transition
- turning source-specific null tokens into empty values
- adjusting parsing behavior for dates, booleans, and typed fields
Supported top-level keys:
fields: per-field transform steps and field-level overridestrait_mappings: exact mapping from lowercase rawtraittext to canonical trait labels. Prefertrait_urnin source prep, and treat this as a compatibility tool.null_values: string tokens treated as null across all fieldscoercions: global parsing and fallback rules for booleans, dates, integers, floats, and text
Minimal example:
fields:
scientificName:
transforms:
- op: strip
- op: case
rule: scientific_name_standard
trait:
transforms:
- op: strip
- op: case
rule: lower
year:
datatype: integer
trait_mappings:
breaking leaf buds: breaking leaf bud present
flowers: flower present
null_values:
- ""
- na
- n/a
- "-"
coercions:
text:
case: lower
date:
input_formats: ["%Y-%m-%d", "%m/%d/%Y"]
output_format: "%Y-%m-%d"
drop_invalid: true
boolean:
true_values: ["true", "t", "yes", "1"]
false_values: ["false", "f", "no", "0"]
drop_invalid: true
integer:
drop_invalid: true
float:
drop_invalid: trueEach entry under fields may define:
transforms: ordered transform steps applied to that fielddatatype: override datatype for coercioncase: field-level case rule applied during text coercionmin: numeric minimum for integer or float fieldsmax: numeric maximum for integer or float fieldsinput_formats: date parse formats for that fieldoutput_format: normalized output date format for that field
Supported transform steps under fields.<field>.transforms:
op: stripop: casewithrule: lower|upper|title|capitalize_first|scientific_name_standardop: regex_subwithpattern, optionalreplacement, and optionalflagsop: regex_mapwithpattern,to, and optionalflagsop: null_if_inwithvaluesop: mapwith exactvaluesreplacements
Notes on behavior:
- transform steps run in the order written
trait_mappingsis only applied to thetraitfield after field transforms run; it does not replacetrait_urntrait_mappingskeys should be lowercase because the loader lowercases the incoming trait before lookupregex_mapuses regexmatch, so patterns should usually be anchored if you want full-value matching- supported regex flags are
IGNORECASE,MULTILINE, andDOTALL null_valuesis global, whilenull_if_inapplies only to one fieldtext.caseundercoercionsis a global default for text fields, but field-levelcaseoverrides it- invalid integers, floats, booleans, and dates are nulled by default unless
--strictis used, in which case those coercion problems reject the row
Example with more complete field transforms:
fields:
recordedBy:
transforms:
- op: strip
- op: regex_sub
pattern: "\\s+"
replacement: " "
basisOfRecord:
transforms:
- op: strip
- op: map
values:
specimen: PreservedSpecimen
photo: HumanObservation
date:
input_formats: ["%Y-%m-%d", "%m/%d/%Y", "%Y/%m/%d"]
output_format: "%Y-%m-%d"
latitude:
datatype: float
min: -90
max: 90
longitude:
datatype: float
min: -180
max: 180
trait:
transforms:
- op: strip
- op: regex_map
pattern: "^open flowers?$"
to: flower present
flags: IGNORECASE
trait_mappings:
flowering: flower present
no flowers: flower absentUse download_csv_dump.py to scroll through the public Phenobase query API and write a local CSV.
python3 download_csv_dump.pyUseful variations:
python3 download_csv_dump.py --query 'genus:Quercus AND year:[2000 TO 2025]' --output downloads/quercus.csv
python3 download_csv_dump.py --limit 100000
python3 download_csv_dump.py --batch-size 10000 --scroll 1m
python3 download_csv_dump.py --request-timeout 60Use update_decade_start.py to add the mapping and backfill decadeStart on an existing live index without reloading source files.
python3 update_decade_start.py
python3 update_decade_start.py --wait
python3 update_decade_start.py --requests-per-second 200The docs/ folder is intended for GitHub Pages publication and for quick sharing with collaborators.
Published outputs:
- docs/index.html: GitHub Pages entry point redirecting to the traits viewer
- docs/traits.html: rendered trait explorer
- docs/traits.csv: published CSV copy
- docs/traits-data.json: viewer payload
- data/traits.csv: ontology-derived trait mapping
- data/columns.csv: field definitions and schema metadata
- trait_lookup.py: local helper for resolving PPO IDs to current labels
- loader.py: ingestion driver
- reasoning/refresh_traits.py: reasoning rebuild driver
- download_csv_dump.py: API export helper
- update_decade_start.py: live index backfill helper
- Python 3.8+
- Elasticsearch reachable for ingestion or maintenance commands
This repository is licensed under the MIT License. Bundled ontology snapshots and other third-party source data may remain subject to their own upstream terms.
PhenoBase Project | Biocode, LLC