Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
881e3eb
Merge pull request #4 from NPLinker/dev
liannette Mar 18, 2025
2430868
Refactor: Simplify Path handling
liannette Mar 18, 2025
1f07246
Refactor: Improve extract path handling to ensure that non-empty dirs…
liannette Mar 18, 2025
8d7238c
Refactor: Extract file cleanup logic into a separate function for bet…
liannette Mar 18, 2025
d5e64eb
Refactor: Move extract_path preparation logic into a seperate functio…
liannette Mar 18, 2025
66f13d6
Refactor: Separate genome assembly resolution and antiSMASH data retr…
liannette Mar 18, 2025
706f841
Refactor: Improve genome ID handling and logging
liannette Mar 18, 2025
b25506e
Refactor: Move logging for antiSMASH data retrieval errors and succes…
liannette Mar 18, 2025
2ed8ac6
simplify comment
liannette Mar 18, 2025
a1e4962
Enhance logging messages in antiSMASH data retrieval
liannette Mar 18, 2025
e8dd391
test: adapt test to changed logging info message
liannette Mar 18, 2025
ea159ca
Feat: Add antiSMASH API functionality
liannette Mar 19, 2025
a972a70
Feat: Add antiSMASH API functionality
liannette Mar 19, 2025
0569976
Merge branch 'feature/antismash-jobs' of https://github.com/liannette…
liannette Mar 19, 2025
f3a159f
fix: improve logging message for start of antiSMASH API process
liannette Mar 19, 2025
da4da3c
docs: improve docstring for download_and_extract_ncbi_genome function
liannette Mar 19, 2025
af094e4
fix: update logging messages for antiSMASH data retrieval failures
liannette Mar 19, 2025
99d99a2
add logging after antiSMASH job submission
liannette Mar 19, 2025
1c83934
refactor: rename refseq_id to genome_assembly_acc
liannette Mar 19, 2025
496db38
improve genome download process with validation and retry logic
liannette Mar 19, 2025
cc6573c
test: add unit tests for download_and_extract_ncbi_genome function
liannette Mar 19, 2025
51e4817
refactor: rename verify_ncbi_dataset_md5_sums to _verify_ncbi_dataset…
liannette Mar 19, 2025
6b564ea
refactor: move _verify_ncbi_dataset_md5_sums function to a new locati…
liannette Mar 19, 2025
7570852
feat: handle already download antiSMASH results
liannette Mar 19, 2025
9881254
fix mistake in docstring
liannette Mar 19, 2025
0e1ec54
fix: update return type of submit_antismash_job to str
liannette Mar 19, 2025
f53f1a7
fix: update return type of download_and_extract_ncbi_genome to Path
liannette Mar 19, 2025
5e5b2bd
update submit_antismash_job to return job ID as string and improve er…
liannette Mar 19, 2025
909ca63
chore: add types-requests to development dependencies
liannette Mar 19, 2025
252a177
fix: update return type of _verify_ncbi_dataset_md5_sums to None and …
liannette Mar 19, 2025
81a0102
fix: update _verify_ncbi_dataset_md5_sums to accept str or PathLike f…
liannette Mar 19, 2025
50890a7
fix: convert extract_path to Path in _prepare_extract_path for consis…
liannette Mar 19, 2025
b784476
chore: update typing dependencies in format-typing-check workflow
liannette Mar 19, 2025
bdebd96
fix: clarify return value documentation for submit_antismash_job func…
liannette Mar 19, 2025
7334c90
fix: enable postponed evaluation of type annotations in ncbi_download…
liannette Mar 19, 2025
2f0d074
feat: add genome accession resolver for NCBI assembly accessions
liannette Mar 20, 2025
5d13fea
test: add unit tests for genome accession resolver functions
liannette Mar 20, 2025
5a201cc
fix: ensure no mypy typing errors
liannette Mar 20, 2025
a8ed146
feat: use the new genome ID resolver
liannette Mar 20, 2025
4ba85d0
refactor: change resolved_refseq_id to resolved_id
liannette Mar 20, 2025
398d700
fix: skip antiSMASH DB retrieval for non refseq ids
liannette Mar 20, 2025
a2e6071
refactor: change resolve_attempted to failed_previously in GenomeStatus
liannette Mar 20, 2025
8387650
feat: save updated genome status to json after each genome
liannette Mar 20, 2025
3779815
fix: assert failed_previously is False in caching test
liannette Mar 20, 2025
a579945
refactor: remove unneccessary if statement
liannette Mar 20, 2025
092c94a
update type hint for genome_id_data to comply with mypy
liannette Mar 21, 2025
47f5618
check if BGC data already downloaded
liannette Mar 21, 2025
91020c9
fix: correct spelling of antiSMASH in logging and comments
liannette Mar 21, 2025
bf6b6c5
fix: add bgc_path to genome status to ensure correct extraction path
liannette Mar 21, 2025
f197fef
refactor: use original genome ID from the genome status object for co…
liannette Mar 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/format-typing-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ jobs:
- name: Install ruff and mypy
run: |
pip install ruff mypy typing_extensions \
types-Deprecated types-beautifulsoup4 types-jsonschema \
types-networkx types-tabulate types-PyYAML pandas-stubs
types-Deprecated types-beautifulsoup4 types-jsonschema types-requests \
types-networkx types-tabulate types-PyYAML pandas-stubs
- name: Get all changed python files
id: changed-python-files
uses: tj-actions/changed-files@v44
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ dev = [
"mypy",
"typing_extensions",
# stub packages. Update the `format-typing-check.yml` too if you add more.
"types-requests",
"types-beautifulsoup4",
"types-jsonschema",
"types-networkx",
Expand Down
16 changes: 14 additions & 2 deletions src/nplinker/genomics/antismash/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,28 @@
from .antismash_downloader import download_and_extract_antismash_data
from .antismash_api_client import antismash_job_is_done
from .antismash_api_client import submit_antismash_job
from .antismash_downloader import download_and_extract_from_antismash_api
from .antismash_downloader import download_and_extract_from_antismash_db
from .antismash_downloader import extract_antismash_data
from .antismash_loader import AntismashBGCLoader
from .antismash_loader import parse_bgc_genbank
from .genome_accession_resolver import resolve_genome_accession
from .ncbi_downloader import download_and_extract_ncbi_genome
from .podp_antismash_downloader import GenomeStatus
from .podp_antismash_downloader import get_best_available_genome_id
from .podp_antismash_downloader import podp_download_and_extract_antismash_data


__all__ = [
"download_and_extract_antismash_data",
"extract_antismash_data",
"resolve_genome_accession",
"download_and_extract_from_antismash_api",
"download_and_extract_from_antismash_db",
"AntismashBGCLoader",
"parse_bgc_genbank",
"GenomeStatus",
"get_best_available_genome_id",
"podp_download_and_extract_antismash_data",
"download_and_extract_ncbi_genome",
"submit_antismash_job",
"antismash_job_is_done",
]
81 changes: 81 additions & 0 deletions src/nplinker/genomics/antismash/antismash_api_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
from __future__ import annotations
import logging
from os import PathLike
from pathlib import Path
import requests


logger = logging.getLogger(__name__)


def submit_antismash_job(genbank_filepath: str | PathLike) -> str:
"""Submits an antiSMASH job using the provided GenBank file.

This function sends a GenBank file to the antiSMASH API
and retrieves the job ID if the submission is successful.

Args:
genbank_filepath (str | PathLike): The path to the GenBank file to be submitted.

Returns:
str: The job ID of the submitted antiSMASH job.

Raises:
requests.exceptions.RequestException: If there is an issue with the HTTP request.
RuntimeError: If the API response does not contain a job ID.
"""
url = "https://antismash.secondarymetabolites.org/api/v1.0/submit"
genbank_filepath = Path(genbank_filepath)

with open(genbank_filepath, "rb") as file:
files = {"seq": file}
response = requests.post(url, files=files)
response.raise_for_status() # Raise an exception for HTTP errors

data = response.json()
if "id" not in data:
raise RuntimeError("No antiSMASH job ID returned")
return str(data["id"])


def antismash_job_is_done(job_id: str) -> bool:
"""Determines if the antiSMASH job has completed by checking its status.

This function queries the antiSMASH API to retrieve the current state
of the job and determines whether it has finished successfully, is still
in progress, or has encountered an error.

Args:
job_id (str): The unique identifier of the antiSMASH job.

Returns:
bool: True if the job is completed successfully, False if it is still
running or queued.

Raises:
RuntimeError: If the job has failed or if the API response indicates an error.
ValueError: If the job state is missing or an unexpected state is encountered
in the API response.
requests.exceptions.HTTPError: If an HTTP error occurs during the API request.
"""
url = f"https://antismash.secondarymetabolites.org/api/v1.0/status/{job_id}"

response = requests.get(url, timeout=10)
response.raise_for_status() # Raise exception for HTTP errors
respose_data = response.json()

if "state" not in respose_data:
raise ValueError(f"Job state missing in response for job_id: {job_id}")

job_state = respose_data["state"]
Comment on lines +65 to +70
Copy link

Copilot AI Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'respose_data' seems to be misspelled; consider renaming it to 'response_data' for clarity.

Suggested change
respose_data = response.json()
if "state" not in respose_data:
raise ValueError(f"Job state missing in response for job_id: {job_id}")
job_state = respose_data["state"]
response_data = response.json()
if "state" not in response_data:
raise ValueError(f"Job state missing in response for job_id: {job_id}")
job_state = response_data["state"]

Copilot uses AI. Check for mistakes.
if job_state in ("running", "queued"):
return False
if job_state == "done":
return True
if job_state == "failed":
job_status = respose_data.get("status", "No error message provided")
raise RuntimeError(f"AntiSMASH job {job_id} failed with an error: {job_status}")
else:
raise ValueError(
f"Unexpected job state for antismash job ID {job_id}. Job state: {job_state}"
)
151 changes: 125 additions & 26 deletions src/nplinker/genomics/antismash/antismash_downloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
import shutil
from os import PathLike
from pathlib import Path
import requests
from nplinker.utils import download_and_extract_archive
from nplinker.utils import extract_archive
from nplinker.utils import list_dirs
from nplinker.utils import list_files

Expand All @@ -15,10 +17,75 @@
ANTISMASH_DB_DOWNLOAD_URL = "https://antismash-db.secondarymetabolites.org/output/{}/{}"
# The antiSMASH DBV2 is for the availability of the old version, better to keep it.
ANTISMASH_DBV2_DOWNLOAD_URL = "https://antismash-dbv2.secondarymetabolites.org/output/{}/{}"
# antismash api to download results from submitted jobs
ANTISMASH_API_DOWNLOAD_URL = "https://antismash.secondarymetabolites.org/upload/{}/{}"


def download_and_extract_antismash_data(
antismash_id: str, download_root: str | PathLike, extract_root: str | PathLike
url: str, antismash_id: str, download_root: str | PathLike, extract_root: str | PathLike
) -> None:
"""Download and extract antiSMASH BGC archive for a specified genome.

This function downloads a BGC archive from the specified URL, extracts its contents,
and organizes the extracted files into a structured directory under the given `extract_root`.

Args:
url (str): The URL to download the BGC archive from.
antismash_id (str): The identifier for the antiSMASH genome, used to name the extraction directory.
download_root: Path to the directory where the downloaded archive will be stored.
extract_root: Path to the directory where the data files will be extracted.
Note that an `antismash` directory will be created in the specified `extract_root` if
it doesn't exist. The files will be extracted to `<extract_root>/antismash/<antismash_id>` directory.

Raises:
ValueError: if `<extract_root>/antismash/<antismash_id>` dir is not empty.
Exception: If any error occurs during the download or extraction process, the partially extracted
directory will be cleaned up, and the exception will be re-raised.

Examples:
>>> download_and_extract_antismash_data(
"https://antismash-db.secondarymetabolites.org/output/GCF_001.1/GCF_001.1.zip",
"GCF_001.1",
"/data/download",
"/data/extracted"
)
"""
extract_path = Path(extract_root) / "antismash" / antismash_id

_prepare_extract_path(extract_path)
try:
download_and_extract_archive(url, download_root, extract_path, f"{antismash_id}.zip")
_cleanup_extracted_files(extract_path)
except Exception as e:
shutil.rmtree(extract_path)
raise e


def download_and_extract_from_antismash_api(
job_id: str, antismash_id: str, download_root: str | PathLike, extract_root: str | PathLike
) -> None:
"""Downloads and extracts results from an antiSMASH API job.

This function constructs the download URL using the provided job ID then
downloads the results as a ZIP file and extracts its contents to the specified directories.

Args:
antismash_id (str): The unique identifier for the antiSMASH dataset.
job_id (str): The job ID for the antiSMASH API job.
download_root (str or PathLike): The root directory where the ZIP file will be downloaded.
extract_root (str or PathLike): The root directory where the contents of the ZIP file will be extracted.

Raises:
requests.exceptions.RequestException: If there is an issue with the HTTP request.
zipfile.BadZipFile: If the downloaded file is not a valid ZIP file.
OSError: If there is an issue with file operations such as writing or extracting.
"""
url = ANTISMASH_API_DOWNLOAD_URL.format(job_id, antismash_id + ".zip")
download_and_extract_antismash_data(url, antismash_id, download_root, extract_root)


def download_and_extract_from_antismash_db(
refseq_acc: str, download_root: str | PathLike, extract_root: str | PathLike
) -> None:
"""Download and extract antiSMASH BGC archive for a specified genome.

Expand All @@ -27,7 +94,7 @@ def download_and_extract_antismash_data(
of a genome as the id of the archive.

Args:
antismash_id: The id used to download BGC archive from antiSMASH database.
refseq_acc: The id used to download BGC archive from antiSMASH database.
If the id is versioned (e.g., "GCF_004339725.1") please be sure to
specify the version as well.
download_root: Path to the directory to place downloaded archive in.
Expand All @@ -36,45 +103,77 @@ def download_and_extract_antismash_data(
it doesn't exist. The files will be extracted to `<extract_root>/antismash/<antismash_id>` directory.

Raises:
ValueError: if `<extract_root>/antismash/<refseq_assembly_id>` dir is not empty.
ValueError: if `<extract_root>/antismash/<refseq_acc>` dir is not empty.

Examples:
>>> download_and_extract_antismash_metadata("GCF_004339725.1", "/data/download", "/data/extracted")
>>> download_and_extract_from_antismash_db("GCF_004339725.1", "/data/download", "/data/extracted")
"""
download_root = Path(download_root)
extract_root = Path(extract_root)
extract_path = extract_root / "antismash" / antismash_id
for base_url in [ANTISMASH_DB_DOWNLOAD_URL, ANTISMASH_DBV2_DOWNLOAD_URL]:
url = base_url.format(refseq_acc, f"{refseq_acc}.zip")
if requests.head(url).status_code == 404: # not found
continue
download_and_extract_antismash_data(url, refseq_acc, download_root, extract_root)
return # Exit the loop once a valid URL is processed

try:
if extract_path.exists():
_check_extract_path(extract_path)
else:
extract_path.mkdir(parents=True, exist_ok=True)
# if both urls give 404 not found
raise RuntimeError(f"No results in antiSMASH DB for {refseq_acc}")

for base_url in [ANTISMASH_DB_DOWNLOAD_URL, ANTISMASH_DBV2_DOWNLOAD_URL]:
url = base_url.format(antismash_id, antismash_id + ".zip")
download_and_extract_archive(url, download_root, extract_path, antismash_id + ".zip")
break

# delete subdirs
for subdir_path in list_dirs(extract_path):
shutil.rmtree(subdir_path)
def extract_antismash_data(
archive: str | PathLike, extract_root: str | PathLike, antimash_id: str
) -> None:
"""Extracts antiSMASH results from a given archive into a specified directory.

# delete unnecessary files
files_to_keep = list_files(extract_path, suffix=(".json", ".gbk"))
for file in list_files(extract_path):
if file not in files_to_keep:
os.remove(file)
This function handles the extraction of antiSMASH results by preparing the
extraction path, extracting the archive, and performing cleanup of
unnecessary files. If an error occurs during the process, the partially
extracted files are removed, and the exception is re-raised.

logger.info("antiSMASH BGC data of %s is downloaded and extracted.", antismash_id)
Args:
archive (str | PathLike): The path to the archive file containing antiSMASH results.
extract_root (str | PathLike): The root directory where the data should
be extracted.
antimash_id (str): A unique identifier for the antiSMASH data, used to
create a subdirectory for the extracted files.

Raises:
Exception: If any error occurs during the extraction process, the
exception is re-raised after cleaning up the extraction directory.
"""
extract_path = Path(extract_root) / "antismash" / antimash_id

_prepare_extract_path(extract_path)

try:
extract_archive(archive, extract_path, remove_finished=False)
_cleanup_extracted_files(extract_path)

except Exception as e:
shutil.rmtree(extract_path)
logger.warning(e)
raise e


def _check_extract_path(extract_path: Path):
# check if extract_path is empty
if any(extract_path.iterdir()):
raise ValueError(f'Nonempty directory: "{extract_path}"')


def _cleanup_extracted_files(extract_path: str | PathLike) -> None:
# delete subdirs
for subdir_path in list_dirs(extract_path):
shutil.rmtree(subdir_path)

# delete unnecessary files
files_to_keep = list_files(extract_path, suffix=(".json", ".gbk"))
for file in list_files(extract_path):
if file not in files_to_keep:
os.remove(file)


def _prepare_extract_path(extract_path: str | PathLike) -> None:
extract_path = Path(extract_path)
if extract_path.exists():
_check_extract_path(extract_path)
else:
extract_path.mkdir(parents=True, exist_ok=True)
Loading
Loading