Add files via upload #15
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
RefSeq API → CDM Pipeline
txt_api.py
fetches genome assembly reports from NCBI Datasets v2 API, parses metadata, and writes normalized results into Delta Lake CDM tables.Workflow
Initialize Spark + Delta Lake
Creates target Delta database if missing.
Fetch Reports
• Endpoint: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/{taxid}/dataset_report
• Defaults: RefSeq only, current assemblies only
• Handles pagination (next_page_token)
Extract Metadata
• Dates: release > assembly > submission (with optional GenBank fallback)
• Names: assembly + organism
• Taxonomy: NCBI TaxID
• Identifiers: BioSample, BioProject, GCF (RefSeq), GCA (GenBank)
Build CDM Tables
• datasource: provenance info
• entity: CDM entity with deterministic UUIDv5
• contig_collection: links entity to TaxID
• name: organism + assembly names
• identifier: BioSample, BioProject, Taxon, GCF/GCA
Write to Delta
Tables written under --database schema, supporting append / overwrite.
Preview Results
First N rows printed if table exists.
Output Tables
CDM table
datasource
CDM table
entity
CDM table
contig collection
collection_id: CDM:00000000-0000-0000-0000-000000000000
contig_collection_type: isolate
ncbi_taxon_id: NCBITaxon:224325
gtdb_taxon_id: null
CDM table
name
entity_id: CDM:00000000-0000-0000-1234567
name: Archaeoglobus fulgidus DSM 4304
description: RefSeq organism name
source: RefSeq
CDM table
name
entity_id: CDM:00000000-0000-0000-1234567
name: ASM866v1
description: RefSeq assembly name
source: RefSeq
CDM table
identifier
entity_id: CDM:00000000-0000-0000-1234567
identifier: Biosample:SAMN02603985 (note: No space)
refer regular express: https://bioregistry.io/registry/biosample (no keep)
source: RefSeq
description: Biosample ID
CDM table
identifier
entity_id: CDM:00000000-0000-0000-1234567
identifier: BioProject:PRJNA104 (note: No space)
refer regular express: https://bioregistry.io/registry/bioproject (no keep)
source: RefSeq
description: BioProject ID
CDM table
identifier
entity_id: CDM:00000000-0000-0000-1234567
identifier: NCBITaxon:224325 (note: No space)
refer regular express: https://bioregistry.io/registry/ncbitaxon (no keep)
source: RefSeq
description: NCBITaxon ID
CDM table
identifier
entity_id: CDM:00000000-0000-0000-1234567
identifier: ncbi.assembly:GCF_000008665.1 (note: No space)
refer regular express: https://bioregistry.io/registry/ncbi.assembly (no keep)
source: RefSeq
description: NCBI Assembly ID
CDM table
identifier
entity_id: CDM:00000000-0000-0000-1234567
identifier: insdc.gca:GCA_000008665.1 (note: No space)
refer regular express: https://bioregistry.io/registry/insdc.gca (no keep)
source: RefSeq
description: GenBank Assembly ID
Examples
--taxid "224325,2741724,193567"
--database refseq_api
--mode overwrite
--taxid "224325"
--database refseq_api
--mode append
--debug
--taxid "224325,2741724"
--database refseq_api
--unique-per-taxon
Note