This project is a complete bioinformatics pipeline to identify and interpret genetic mutations in Staphylococcus aureus samples (PRJNA325350) that evolved resistance to Daptomycin (DAP) and Vancomycin (VAN).
The pipeline starts from raw FASTQ reads, performs QC, alignment (BWA), variant calling (GATK), and annotation (SnpEff), and concludes with a deep downstream analysis (Pandas) and visualization (Seaborn) to identify the "prime suspect" genes responsible for antibiotic resistance.
The primary goal was to find de novo mutations (variants present in treated samples but absent from the Control). The analysis successfully identified 48 such mutations.
The visualization below summarizes the key evolutionary pathways taken by the DAP and VAN lineages:
The analysis revealed distinct and clinically relevant resistance strategies:
-
Daptomycin (DAP) Resistance (Membrane Adaptation):
- The DAP lineage developed significant
missensemutations in genes critical for cell membrane homeostasis, includingmprF(a well-known DAP resistance gene) andagrA(a master regulator of virulence).
- The DAP lineage developed significant
-
Vancomycin (VAN) Resistance (Cell Wall Stress Response):
- The VAN lineage developed
missensemutations in sensor systems likevraGandvraT, which are key components of theVrasystem that senses cell wall damage.
- The VAN lineage developed
-
The "Hypermutator" Engine (
mutL):- The most significant finding was the emergence of high-impact
frameshiftmutations in the DNA mismatch-repair genemutLin the DAP lineage. This "knock-out" likely created a "hypermutator" state, accelerating the rate of mutation and allowing genes likemprFto acquire resistance mutations rapidly.
- The most significant finding was the emergence of high-impact
The project follows a clean, step-by-step notebook workflow. Each notebook is self-contained and completes one major stage of the pipeline.
00_Setup_and_Download.ipynb: Installs all tools (GATK, BWA, SnpEff, etc.) and downloads the 7 SRA samples.01_QC_and_Trimming.ipynb: Runs FastQC, identifies adapters, and cleans reads usingfastp.02_Mapping_and_BAM_Processing.ipynb: Indexes the reference and maps reads usingbwa mem(with Read Groups) andsamtools.03_Variant_Calling_GATK.ipynb: Creates sequence dictionaries and calls variants usinggatk HaplotypeCallerto produce 7 VCF files.04_Variant_Annotation.ipynb: Builds a custom SnpEff database and annotates the VCF files, producing 7.ann.vcf.gzfiles and 7.genes.txtreports.05_Downstream_Analysis_and_Interpretation.ipynb: Programmatically aggregates the 7 reports usingpandas, performs a set-difference analysis to find de novo mutations, and saves the final 48 "prime suspects" to a CSV.06_Final_Visualization.ipynb: Loads the final CSV and generates the publication-quality heatmap usingmatplotlib/seaborn.
-
Clone this repository:
git clone https://github.com/refmyoussef-source/project_variant_calling.git cd project_variant_calling -
Create the Conda Environment:
- This project's dependencies are managed in the
environment.ymlfile.
conda env create -f environment.yml conda activate variant_call_env
- This project's dependencies are managed in the
-
Run the Pipeline:
- Open the
notebooks/directory and run the notebooks in numerical order (from00to06).
- Open the
