This repository contains all relevant code, documentation, and results for our project in the course Biomedical Data Types at Hasso Plattner Institute.
The project focuses on performing a genome-wide association study (GWAS) using the synthetic HAPNEST dataset. We analyze genetic associations separately for two ancestry groups – European (EUR) and African (AFR) – and combine the results through meta-analysis. The goal is to explore ancestry-specific patterns in genetic data and assess how population structure impacts GWAS outcomes.
HAPNEST_GWAS/
├── data/
│ ├── raw/ # Original input data (PLINK files per chromosome)
│ ├── processed/ # Filtered & QC'd data per ancestry group
│ ├── maps/ # rsID mapping files per chromosome
│ └── results/ # results and visualizations
│
├── notebooks/
│ ├── 01_preprocessing.ipynb # Data filtering, phenotype assignment and quality control
│ ├── 02_gwas_analysis.ipynb # GWAS per ancestry and meta-analysis
│ └── 03_results_visualization.ipynb # Manhattan plots and summary visuals
│
├── requirements.txt # Python dependencies
└── README.md- Clone this repository and set up a python virtual environment.
- Install dependencies:
pip install -r requirements.txt. - Download PLINK into the project folder.
- Download the HAPNEST dataset manually into data/raw/.
- Run the preprocessing, QC and GWAS for both ancestries
AFRandEUR. - Run the meta analysis.
- Create plots & visualizations with
03_results_visualization.ipynb.
- PLINK v1.9 – for QC, GWAS, and meta-analysis (Slifer, 2018)
- Python 3.11 – data handling and preprocessing
- Libraries listed in
requirements.txt - Jupyter Notebook – reproducible workflow and step-by-step analysis
This project was developed as part of the course Biomedical Data Types at HPI. Dataset credit: HAPNEST project (S-BSST936, EBI). Code and content are for academic use only.
- Maximilian Kalff
- Emil Kobel