Skip to content

Conversation

alinakbase
Copy link
Collaborator

RefSeq API → CDM Pipeline

txt_api.py fetches genome assembly reports from NCBI Datasets v2 API, parses metadata, and writes normalized results into Delta Lake CDM tables.

Workflow

Initialize Spark + Delta Lake
Creates target Delta database if missing.

Fetch Reports
• Endpoint: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/{taxid}/dataset_report
• Defaults: RefSeq only, current assemblies only
• Handles pagination (next_page_token)

Extract Metadata
• Dates: release > assembly > submission (with optional GenBank fallback)
• Names: assembly + organism
• Taxonomy: NCBI TaxID
• Identifiers: BioSample, BioProject, GCF (RefSeq), GCA (GenBank)

Build CDM Tables
• datasource: provenance info
• entity: CDM entity with deterministic UUIDv5
• contig_collection: links entity to TaxID
• name: organism + assembly names
• identifier: BioSample, BioProject, Taxon, GCF/GCA

Write to Delta
Tables written under --database schema, supporting append / overwrite.

Preview Results
First N rows printed if table exists.

Output Tables

CDM table datasource

name: RefSeq
source: NCBI RefSeq
url: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/
accessed: <today’s date>
version: 231 

CDM table entity

entity_id: CDM:00000000-0000-0000-0000-000000000000
entity_type: contig_collection
data_source: RefSeq
created: 2000-12-01
updated: 2025-09-03T12:34:56

CDM table contig collection

collection_id: CDM:00000000-0000-0000-0000-000000000000
contig_collection_type: isolate
ncbi_taxon_id: NCBITaxon:224325
gtdb_taxon_id: null

CDM table name

entity_id: CDM:00000000-0000-0000-1234567
name: Archaeoglobus fulgidus DSM 4304
description: RefSeq organism name
source: RefSeq

CDM table name

entity_id: CDM:00000000-0000-0000-1234567
name: ASM866v1
description: RefSeq assembly name
source: RefSeq

CDM table identifier

entity_id: CDM:00000000-0000-0000-1234567
identifier: Biosample:SAMN02603985 (note: No space)

refer regular express: https://bioregistry.io/registry/biosample (no keep)

source: RefSeq
description: Biosample ID

CDM table identifier

entity_id: CDM:00000000-0000-0000-1234567
identifier: BioProject:PRJNA104 (note: No space)

refer regular express: https://bioregistry.io/registry/bioproject (no keep)

source: RefSeq
description: BioProject ID

CDM table identifier

entity_id: CDM:00000000-0000-0000-1234567
identifier: NCBITaxon:224325 (note: No space)

refer regular express: https://bioregistry.io/registry/ncbitaxon (no keep)

source: RefSeq
description: NCBITaxon ID

CDM table identifier

entity_id: CDM:00000000-0000-0000-1234567
identifier: ncbi.assembly:GCF_000008665.1 (note: No space)

refer regular express: https://bioregistry.io/registry/ncbi.assembly (no keep)

source: RefSeq
description: NCBI Assembly ID

CDM table identifier

entity_id: CDM:00000000-0000-0000-1234567
identifier: insdc.gca:GCA_000008665.1 (note: No space)

refer regular express: https://bioregistry.io/registry/insdc.gca (no keep)

source: RefSeq
description: GenBank Assembly ID

Examples

•	Multiple taxa, overwrite

python txt_api.py \

--taxid "224325,2741724,193567"
--database refseq_api
--mode overwrite

•	Single taxon, append with debug
python txt_api.py \

--taxid "224325"
--database refseq_api
--mode append
--debug

•	Unique assembly per taxon
python txt_api.py \

--taxid "224325,2741724"
--database refseq_api
--unique-per-taxon

Note

•	Defaults: RefSeq only, current assemblies only
•	Deterministic UUIDv5 for stable entity IDs across runs
•	Deduplication at entity, name, and identifier levels
•	Tables produced: datasource, entity, contig_collection, name, identifier

Copy link

codecov bot commented Sep 3, 2025

Codecov Report

❌ Patch coverage is 0% with 382 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.21%. Comparing base (dd1b528) to head (239250a).

Files with missing lines Patch % Lines
src/parsers/refseq_api.py 0.00% 382 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main      #15       +/-   ##
===========================================
- Coverage   87.26%   53.21%   -34.06%     
===========================================
  Files           9       10        +1     
  Lines         597      979      +382     
===========================================
  Hits          521      521               
- Misses         76      458      +382     
Files with missing lines Coverage Δ
src/parsers/refseq_api.py 0.00% <0.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd1b528...239250a. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# -------- Special schema for problematic table --------
if table == "contig_collection":
schema = StructType([
StructField("collection_id", StringType(), True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be contig_collection_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants