Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions src/data/README_seed_ontology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# SEED Ontology Data

This directory contains the SEED role ontology used for mapping RAST annotations to seed.role identifiers.

## Current Files

- **`seed.owl`** - SEED role ontology in OWL format with correct pubseed.theseed.org URLs
- **`seed.json`** - SEED role ontology in JSON-LD format (converted from seed.owl using ROBOT)

## File Generation

The `seed.json` file was generated from `seed.owl` using the following process:

1. **Source**: The `seed.owl` file contains the SEED ontology with correct pubseed.theseed.org URLs
2. **Conversion**: The OWL file was converted to JSON using ROBOT (http://robot.obolibrary.org/):
```bash
robot convert --input seed.owl --output seed.json
```

The OBO format includes idspace definitions like:
```
idspace: seed.role https://pubseed.theseed.org/RoleEditor.cgi?page=ShowRole&Role=
```

This results in JSON nodes with URL-based IDs that the mapper handles automatically.

## Future Plans

### Official Source
We are currently working to identify the official, versioned source for SEED ontology files. Once established, this file will be updated from:
- Official SEED OWL file location (TBD)
- Official SEED OBO file location (TBD)

### Automated Updates
Future versions will implement:
1. Automated fetching from the official source
2. Version tracking and changelog
3. Conversion pipeline from OWL/OBO to JSON as needed
4. Regular updates synchronized with SEED releases

### Versioning
When the official source is established, we will:
- Track the SEED ontology version in the filename (e.g., `seed_ontology_v2024.1.json.gz`)
- Maintain a version history file
- Document any custom modifications or additions

## File Format

The JSON file follows the JSON-LD format with this structure:
```json
{
"graphs": [{
"nodes": [
{
"id": "https://pubseed.theseed.org/RoleEditor.cgi?page=ShowRole&Role=0000000010501",
"lbl": "Alpha-ketoglutarate permease",
"type": "CLASS"
}
]
}]
}
```

The mapper automatically extracts the role number from the URL to create clean `seed.role:XXXXXXXXXX` identifiers.

## Updating the Ontology

To update the ontology file when an official source becomes available:

1. Download the latest OWL or OBO file from the official source
2. Convert to JSON using ROBOT:
```bash
# From OBO
robot convert --input seed.obo --output seed_ontology.json

# From OWL
robot convert --input seed.owl --output seed_ontology.json
```
3. Compress the file:
```bash
gzip -9 seed_ontology.json
```
4. Replace the existing file in this directory
5. Update this README with the new version information
6. Run tests to ensure compatibility:
```bash
pytest tests/test_rast_seed_mapper.py
```

## Notes

- The compressed file is automatically decompressed on first use by the mapper
- The JSON format was chosen for fast parsing and broad compatibility
- The mapper supports both URL-based and clean ID formats for flexibility
21 changes: 21 additions & 0 deletions src/data/example_rast_annotations.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
annotation,category,expected_seed_role_id
"Alpha-ketoglutarate permease","simple","seed.role:0000000010501"
"Thioredoxin 2","simple","seed.role:0000000049856"
"ATP-dependent RNA helicase SrmB","simple",""
"L-aspartate oxidase (EC 1.4.3.16)","simple",""
"Unknown function","simple","seed.role:0000000031207"
"Protein of unknown function DUF1537","simple",""
"Putative cytoplasmic protein","simple",""
"Phosphoribosylformylglycinamidine synthase, synthetase subunit (EC 6.3.5.3) / Phosphoribosylformylglycinamidine synthase, glutamine amidotransferase subunit (EC 6.3.5.3)","multi_slash",""
"GMP synthase [glutamine-hydrolyzing], amidotransferase subunit (EC 6.3.5.2) / GMP synthase [glutamine-hydrolyzing], ATP pyrophosphatase subunit (EC 6.3.5.2)","multi_slash","seed.role:0000000002981"
"Flavohemoglobin / Nitric oxide dioxygenase (EC 1.14.12.17)","multi_slash",""
"23S rRNA (adenine(2503)-C(2))-methyltransferase @ tRNA (adenine(37)-C(2))-methyltransferase (EC 2.1.1.192)","multi_at","seed.role:0000000034022"
"Uracil permease @ Uracil:proton symporter UraA","multi_at","seed.role:0000000002527"
"ATP:Cob(I)alamin adenosyltransferase (EC 2.5.1.17) @ ATP:Cob(I)alamin adenosyltransferase (EC 2.5.1.17), ethanolamine utilization","multi_at",""
"Lead, cadmium, zinc and mercury transporting ATPase (EC 3.6.3.3) (EC 3.6.3.5); Copper-translocating P-type ATPase (EC 3.6.3.4)","multi_semicolon","seed.role:0000000010083"
"UDP-sugar hydrolase (EC 3.6.1.45); 5'-nucleotidase (EC 3.1.3.5)","multi_semicolon",""
"orf; Unknown function","multi_semicolon",""
"","edge_case",""
" ","edge_case",""
"Hypothetical protein","edge_case",""
"Mobile element protein","edge_case",""
84 changes: 84 additions & 0 deletions src/data/example_rast_annotations.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
{
"description": "Example RAST annotations demonstrating various formats and edge cases",
"annotations": [
{
"category": "simple_annotations",
"examples": [
"Alpha-ketoglutarate permease",
"Thioredoxin 2",
"ATP-dependent RNA helicase SrmB",
"L-aspartate oxidase (EC 1.4.3.16)",
"Unknown function",
"Protein of unknown function DUF1537",
"Putative cytoplasmic protein"
]
},
{
"category": "multi_function_slash",
"separator": " / ",
"examples": [
"Phosphoribosylformylglycinamidine synthase, synthetase subunit (EC 6.3.5.3) / Phosphoribosylformylglycinamidine synthase, glutamine amidotransferase subunit (EC 6.3.5.3)",
"GMP synthase [glutamine-hydrolyzing], amidotransferase subunit (EC 6.3.5.2) / GMP synthase [glutamine-hydrolyzing], ATP pyrophosphatase subunit (EC 6.3.5.2)",
"Flavohemoglobin / Nitric oxide dioxygenase (EC 1.14.12.17)",
"Inosine-5'-monophosphate dehydrogenase (EC 1.1.1.205) / CBS domain",
"PTS system, N-acetylmuramic acid-specific IIB component (EC 2.7.1.192) / PTS system, N-acetylmuramic acid-specific IIC component"
]
},
{
"category": "multi_function_at",
"separator": " @ ",
"examples": [
"23S rRNA (adenine(2503)-C(2))-methyltransferase @ tRNA (adenine(37)-C(2))-methyltransferase (EC 2.1.1.192)",
"Uracil permease @ Uracil:proton symporter UraA",
"ATP:Cob(I)alamin adenosyltransferase (EC 2.5.1.17) @ ATP:Cob(I)alamin adenosyltransferase (EC 2.5.1.17), ethanolamine utilization",
"Acetaldehyde dehydrogenase (EC 1.2.1.10) @ Acetaldehyde dehydrogenase (EC 1.2.1.10), ethanolamine utilization cluster",
"2-deoxyglucose-6-phosphate hydrolase (EC 3.1.3.68) @ Mannitol-1-phosphatase (EC 3.1.3.22) @ Sorbitol-6-phosphatase (EC 3.1.3.50)"
]
},
{
"category": "multi_function_semicolon",
"separator": "; ",
"examples": [
"Competence protein F homolog, phosphoribosyltransferase domain; protein YhgH required for utilization of DNA as sole source of carbon and energy",
"Lead, cadmium, zinc and mercury transporting ATPase (EC 3.6.3.3) (EC 3.6.3.5); Copper-translocating P-type ATPase (EC 3.6.3.4)",
"UDP-sugar hydrolase (EC 3.6.1.45); 5'-nucleotidase (EC 3.1.3.5)",
"orf; Unknown function",
"Putative Dihydrolipoamide dehydrogenase (EC 1.8.1.4); Mercuric ion reductase (EC 1.16.1.1); PF00070 family, FAD-dependent NAD(P)-disulphide oxidoreductase"
]
},
{
"category": "mixed_separators",
"description": "Annotations with multiple types of separators",
"examples": [
"Function A / Function B @ Function C",
"Enzyme (EC 1.2.3.4) @ Variant 1; Enzyme (EC 1.2.3.4) @ Variant 2",
"Complex I / Complex II; Associated protein @ Regulatory subunit"
]
},
{
"category": "edge_cases",
"description": "Special cases that might be challenging",
"examples": [
"",
" ",
"Function with trailing space ",
" Function with leading space",
"Function (with) [special] {characters} <test>",
"Very long annotation string that goes on and on with multiple enzyme specifications (EC 1.2.3.4) and various subunits and domains that might exceed typical length expectations in database systems",
"Hypothetical protein",
"Mobile element protein",
"FIG00012345: hypothetical protein",
"orf, hypothetical protein"
]
}
],
"expected_mappings": {
"Alpha-ketoglutarate permease": "seed.role:0000000010501",
"Thioredoxin 2": "seed.role:0000000049856",
"Unknown function": "seed.role:0000000031207",
"23S rRNA (adenine(2503)-C(2))-methyltransferase": "seed.role:0000000034022",
"Uracil permease": "seed.role:0000000008848",
"GMP synthase [glutamine-hydrolyzing], amidotransferase subunit (EC 6.3.5.2)": "seed.role:0000000002981",
"Lead, cadmium, zinc and mercury transporting ATPase (EC 3.6.3.3) (EC 3.6.3.5)": "seed.role:0000000052456"
}
}
Loading