Skip to content

SuLab/OKN-WOBD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OKN-WOBD

Extract data from the NIAID Data Ecosystem Discovery Portal and convert for loading into ProtoOKN.

Python Environment Setup

Use pyenv to install the Python version needed, then create an isolated venv in the repository:

pyenv install 3.12.6
pyenv local 3.12.6
python -m venv .venv
source .venv/bin/activate     # on Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -e .

Fetching NDE Data

Run the Click CLI (installed as the okn-wobd console script) to download dataset records for one or more resources. By default, results are written to data/raw as JSON Lines (*.jsonl) with matching checkpoint files (*_state.json) to support restarts.

okn-wobd fetch --resource ImmPort
# or, equivalently:
python -m okn_wobd.cli fetch --resource ImmPort

Options

  • --all: Fetch all available resources from the NDE API (automatically discovers all Dataset Repositories). Resources listed in src/okn_wobd/excluded_resources.py are excluded.
  • --resource: Repeatable; defaults to ImmPort when omitted. Examples: ImmPort, "VDJ Server", Vivli, RADx, PDB, "Project TYCHO".
  • --output-dir: Directory for saved data and checkpoints (default: data/raw).
  • --page-size: Batch size for API pagination (default: 100, maximum: 1000).
  • --facet-size: Passed through to the API facet parameter (default: 10).
  • --restart: Ignore previous checkpoints and start from the first page.
  • --verbose: Emit detailed logging.
  • --max-window: Maximum result window before automatic segmentation (default: 10,000).
  • --segment-field, --segment-charset, --segment-max-length: Controls for prefix-based segmentation when a catalog exceeds the result window.

Restarting After Failures

The CLI records progress for each resource in <output-dir>/<resource>_state.json. Rerun the command without --restart to resume where it left off. Supply --restart to discard prior results and fetch everything again from the beginning.

Example: Fetch All Available Dataset Repository Resources

okn-wobd fetch --all

This queries the NDE API to discover all available dataset repository resources and fetches data for each one (excluding resources configured in src/okn_wobd/excluded_resources.py, such as "Protein Data Bank").

Example: Fetch Multiple Specific Resources

python -m okn_wobd.cli fetch \
  --resource ImmPort \
  --resource "VDJ Server" \
  --resource Vivli

This will create separate JSONL and checkpoint files for each resource under data/raw/.

Handling Resources with >10k Records

Elasticsearch-backed endpoints limit from + size <= 10,000. When a catalog (for example, Protein Data Bank) exceeds that window, the CLI automatically partitions requests by prefix on the identifier field. You can tune the behavior:

okn-wobd fetch \
  --resource "Protein Data Bank" \
  --max-window 10000 \
  --segment-field identifier \
  --segment-charset 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ \
  --segment-max-length 8

The state file tracks both the current segment and offset so interrupted runs can resume without re-downloading data.

Summarizing Downloaded JSONL Files

After fetching data, generate a quick overview of record counts and coverage of disease/species/infectious-agent metadata:

python scripts/summarize_jsonl.py

The script scans data/raw/*.jsonl, writes the results to reports/jsonl_summary.md, and prints the table to stdout.

Listing Top-Level Fields in JSONL Files

python scripts/list_jsonl_fields.py

This writes a field inventory to reports/jsonl_fields.md.

Converting JSONL to RDF N-Triples

Convert dataset records from JSONL format to RDF N-Triples (.nt) format for loading into FRINK:

okn-wobd convert

Options

  • --input-dir: Directory containing JSONL files (default: data/raw).
  • --output-dir: Directory to write N-Triples files (default: data/rdf).
  • --resource: Repeatable; convert specific resources. If omitted, converts all JSONL files found in input directory.

Examples

# Convert all resources
okn-wobd convert

# Convert specific resources
okn-wobd convert --resource ImmPort --resource "VDJ Server"

# Specify custom input/output directories
okn-wobd convert --input-dir data/raw --output-dir data/rdf

The converter generates one .nt file per resource in the output directory. Each dataset is assigned a URI in the https://okn.wobd.org/ namespace using the pattern https://okn.wobd.org/dataset/{resource}/{_id}.

The converter uses external URIs for shared entities:

  • Diseases: MONDO ontology URIs (e.g., http://purl.obolibrary.org/obo/MONDO_*)
  • Species: UniProt taxonomy URIs (e.g., https://www.uniprot.org/taxonomy/*)
  • Infectious Agents: UniProt taxonomy URIs
  • Organizations: ROR identifiers when available (e.g., https://ror.org/*)
  • DOIs: Converted to https://doi.org/* URIs

The converter follows Proto-OKN Best Practice Guidelines:

  • ✅ Includes RDFS axioms for Schema.org classes (rdfs:Class declarations)
  • ✅ Includes RDFS domain and range assertions for properties
  • ✅ Adds owl:sameAs mappings alongside schema:sameAs for external identifiers

Elasticsearch metadata fields (_score, _ignored, @version) are excluded from the RDF output.

Testing Competency Queries

After converting data to RDF, you can test the competency question SPARQL queries against your local data:

# Test all queries
python scripts/test_competency_queries.py

# Test a specific query
python scripts/test_competency_queries.py --query CQ2

# Show detailed results including sample data
python scripts/test_competency_queries.py --query CQ2 --verbose

Options

  • --rdf-dir: Directory containing RDF .nt files (default: data/rdf).
  • --queries-file: Markdown file with competency questions (default: docs/competency_questions.md).
  • --query: Test only a specific query (e.g., CQ2 or CQ10).
  • --verbose: Show query results and detailed error messages.

The script extracts SPARQL queries from the markdown file, loads all RDF files from the specified directory, and executes each query to verify it works correctly. This is useful for:

  • Validating queries before using them in Protege or FRINK
  • Testing query syntax and compatibility
  • Verifying queries return expected results against your data

See the documentation for more details, including competency questions and SPARQL query examples.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages