Add files via upload #15

alinakbase · 2025-09-03T18:01:43Z

RefSeq API → CDM Pipeline

txt_api.py fetches genome assembly reports from NCBI Datasets v2 API, parses metadata, and writes normalized results into Delta Lake CDM tables.

Workflow

Initialize Spark + Delta Lake
Creates target Delta database if missing.

Fetch Reports
• Endpoint: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/{taxid}/dataset_report
• Defaults: RefSeq only, current assemblies only
• Handles pagination (next_page_token)

Extract Metadata
• Dates: release > assembly > submission (with optional GenBank fallback)
• Names: assembly + organism
• Taxonomy: NCBI TaxID
• Identifiers: BioSample, BioProject, GCF (RefSeq), GCA (GenBank)

Build CDM Tables
• datasource: provenance info
• entity: CDM entity with deterministic UUIDv5
• contig_collection: links entity to TaxID
• name: organism + assembly names
• identifier: BioSample, BioProject, Taxon, GCF/GCA

Write to Delta
Tables written under --database schema, supporting append / overwrite.

Preview Results
First N rows printed if table exists.

Output Tables

CDM table `datasource`

name: RefSeq
source: NCBI RefSeq
url: https://api.ncbi.nlm.nih.gov/datasets/v2/genome/taxon/
accessed: <today’s date>
version: 231

CDM table `entity`

entity_id: CDM:00000000-0000-0000-0000-000000000000
entity_type: contig_collection
data_source: RefSeq
created: 2000-12-01
updated: 2025-09-03T12:34:56

CDM table `contig collection`

collection_id: CDM:00000000-0000-0000-0000-000000000000
contig_collection_type: isolate
ncbi_taxon_id: NCBITaxon:224325
gtdb_taxon_id: null

CDM table `name`

entity_id: CDM:00000000-0000-0000-1234567
name: Archaeoglobus fulgidus DSM 4304
description: RefSeq organism name
source: RefSeq

CDM table `name`

entity_id: CDM:00000000-0000-0000-1234567
name: ASM866v1
description: RefSeq assembly name
source: RefSeq

CDM table `identifier`

entity_id: CDM:00000000-0000-0000-1234567
identifier: Biosample:SAMN02603985 (note: No space)

refer regular express: https://bioregistry.io/registry/biosample (no keep)

source: RefSeq
description: Biosample ID

CDM table `identifier`

entity_id: CDM:00000000-0000-0000-1234567
identifier: BioProject:PRJNA104 (note: No space)

refer regular express: https://bioregistry.io/registry/bioproject (no keep)

source: RefSeq
description: BioProject ID

CDM table `identifier`

entity_id: CDM:00000000-0000-0000-1234567
identifier: NCBITaxon:224325 (note: No space)

refer regular express: https://bioregistry.io/registry/ncbitaxon (no keep)

source: RefSeq
description: NCBITaxon ID

CDM table `identifier`

entity_id: CDM:00000000-0000-0000-1234567
identifier: ncbi.assembly:GCF_000008665.1 (note: No space)

refer regular express: https://bioregistry.io/registry/ncbi.assembly (no keep)

source: RefSeq
description: NCBI Assembly ID

CDM table `identifier`

entity_id: CDM:00000000-0000-0000-1234567
identifier: insdc.gca:GCA_000008665.1 (note: No space)

refer regular express: https://bioregistry.io/registry/insdc.gca (no keep)

source: RefSeq
description: GenBank Assembly ID

Examples

•	Multiple taxa, overwrite

python txt_api.py \

--taxid "224325,2741724,193567"
--database refseq_api
--mode overwrite

•	Single taxon, append with debug
python txt_api.py \

--taxid "224325"
--database refseq_api
--mode append
--debug

•	Unique assembly per taxon
python txt_api.py \

--taxid "224325,2741724"
--database refseq_api
--unique-per-taxon

Note

•	Defaults: RefSeq only, current assemblies only
•	Deterministic UUIDv5 for stable entity IDs across runs
•	Deduplication at entity, name, and identifier levels
•	Tables produced: datasource, entity, contig_collection, name, identifier

codecov · 2025-09-03T18:02:19Z

Codecov Report

❌ Patch coverage is 0% with 382 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.21%. Comparing base (dd1b528) to head (239250a).

Files with missing lines	Patch %	Lines
src/parsers/refseq_api.py	0.00%	382 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #15       +/-   ##
===========================================
- Coverage   87.26%   53.21%   -34.06%     
===========================================
  Files           9       10        +1     
  Lines         597      979      +382     
===========================================
  Hits          521      521               
- Misses         76      458      +382

Files with missing lines	Coverage Δ
src/parsers/refseq_api.py	`0.00% <0.00%> (ø)`

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd1b528...239250a. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ialarmedalien · 2025-09-05T17:01:34Z

src/parsers/refseq_api.py

+    # -------- Special schema for problematic table --------
+    if table == "contig_collection":
+        schema = StructType([
+            StructField("collection_id", StringType(), True),


this should be contig_collection_id

Add files via upload

239250a

alinakbase and others added 2 commits September 3, 2025 11:48

Add files via upload

13e7df3

Merge branch 'main' into alinakbase-patch-1

6791039

ialarmedalien reviewed Sep 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add files via upload #15

Add files via upload #15

Uh oh!

alinakbase commented Sep 3, 2025

Uh oh!

codecov bot commented Sep 3, 2025 •

edited

Loading

Uh oh!

ialarmedalien Sep 5, 2025

Uh oh!

Uh oh!

Add files via upload #15

Are you sure you want to change the base?

Add files via upload #15

Uh oh!

Conversation

alinakbase commented Sep 3, 2025

RefSeq API → CDM Pipeline

Workflow

Output Tables

CDM table datasource

CDM table entity

CDM table contig collection

CDM table name

CDM table name

CDM table identifier

refer regular express: https://bioregistry.io/registry/biosample (no keep)

CDM table identifier

refer regular express: https://bioregistry.io/registry/bioproject (no keep)

CDM table identifier

refer regular express: https://bioregistry.io/registry/ncbitaxon (no keep)

CDM table identifier

refer regular express: https://bioregistry.io/registry/ncbi.assembly (no keep)

CDM table identifier

refer regular express: https://bioregistry.io/registry/insdc.gca (no keep)

Examples

Note

Uh oh!

codecov bot commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ialarmedalien Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CDM table `datasource`

CDM table `entity`

CDM table `contig collection`

CDM table `name`

CDM table `name`

CDM table `identifier`

CDM table `identifier`

CDM table `identifier`

CDM table `identifier`

CDM table `identifier`

codecov bot commented Sep 3, 2025 •

edited

Loading