Skip to content

omicsedge/orchestra-paper

Repository files navigation

Paper

Orchestra is a pipeline for genetic ancestry inference using deep learning. This README provides setup and usage instructions.

Setup Instructions

1. Download and Prepare Reference Files

Download and process reference FASTA files (~600MB) from NCBI:

for chr in {1..22}; do
    # Download chromosome files
    wget --timestamping https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_assembly_structure/Primary_Assembly/assembled_chromosomes/FASTA/chr$chr.fna.gz
    
    # Process files
    bgzip -d chr$chr.fna.gz
    head -n 1 chr$chr.fna > header
    sed -e 's/[NRWYMKBSV]/A/g' -e '1d' -e 'y/actg/ACTG/' chr$chr.fna > chr$chr.fa
    cat header chr$chr.fa > data/fasta/$chr.fa
    
    # Cleanup
    rm header chr$chr.fa chr$chr.fna
done

2. Reference Files Structure

The following files are required:

File/Directory Description
data/fasta/*.fa Chromosome FASTA files (chr1-22)
reference_files/QC_ancestry_associated_SNP_set.hg38.keep QC ancestry-associated SNP set
reference_files/Ancestry_regions.hg38.txt Ancestry regions defined for our custom panel
data/toy_example/Source_panel.vcf.gz Source panel for simulation
data/toy_example/SampleTable.forTraining.txt Population structure definitions
data/toy_example/Admixed_Mexicans.target_panel.vcf.gz Test data for inference

File Format Specifications

FASTA Files (*.fa)
  • One file per chromosome (chr1.fa to chr22.fa)
  • Contains reference genome sequence
  • Processed from NCBI reference files
Ancestry SNP Set (QC_ancestry_associated_SNP_set.hg38.keep)
  • Space-separated text file
  • Contains 1,202,443 ancestry-informative variants
  • Format: CHROM POS REF ALT
  • Based on hg38/GRCh38 assembly
Ancestry Regions (Ancestry_regions.hg38.txt)
  • Tab-separated text file
  • Defines genomic regions for ancestry analysis
  • Format: CHROM START_POS END_POS
  • Based on hg38/GRCh38 assembly
Source Panel (Source_panel.vcf.gz)
  • Compressed VCF format
  • Contains genetic variants for simulation
  • Required fields: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO
Sample Map (SampleTable.forTraining.txt)
  • Tab-separated values
  • Defines population structure
  • Required columns: Sample ID, Population, Super Population

3. Set Environment Variable

export EXPERIMENT_NAME="example-0.01"

4. Build Docker Images

make build

5. Run Simulation Pipeline

docker run --rm \
    -v $(pwd)/data:/data \
    -v $(pwd)/results:/results \
    orchestra simulation\
    -sc 1 -ec 22 \
    -sp /data/toy_example/Source_panel.vcf.gz \
    -sm /data/toy_example/SampleTable.forTraining.txt \
    -v $EXPERIMENT_NAME \
    -t "random" \
    -nt 2 \
    -o /results/simulation
Simulation Parameters
Parameter Description
-sc, --start-chromosome Start chromosome number
-ec, --end-chromosome End chromosome number
-sp, --source-panel Source panel VCF path
-sm, --sample-map Sample map TSV path
-v, --version Version identifier
-t, --type Simulation type
-nt, --num-threads Number of threads
-o, --output Output directory

6. Train Models

Process chromosomes in pairs (smallest 19-22 chromosomes grouped together):

for chr in "1 2" "3 4" "5 6" "7 8" "9 10" "11 12" "13 14" "15 16" "17 18" "19 22"; do
    start_chr=$(echo $chr | cut -d' ' -f1)
    end_chr==$(echo $chr | cut -d' ' -f2)

    docker run --rm \
        -v $(pwd)/data:/data \
        -v $(pwd)/results:/results \
        orchestra training \
        -sd /results/simulation \
        -sc $start_chr \
        -ec $end_chr \
        -ws 600 \
        -l 3 \
        -v $EXPERIMENT_NAME \
        -e 100 \
        -o /results/training 

    echo "✓ Completed chromosomes $start_chr-$end_chr"
done
Training Parameters
Parameter Description
-sd, --simulation-dir Simulation data directory
-sc, --start-chromosome Start chromosome
-ec, --end-chromosome End chromosome
-ws, --window-size Processing window size
-l, --level Model complexity level
-o, --output Output directory
-v, --version Version identifier
-e, --epochs Training epochs

7. Run Inference Pipeline

docker run --rm \
    -v $(pwd)/data:/data \
    -v $(pwd)/results:/results \
    orchestra inference \
    -p /data/toy_example/Admixed_Mexicans.target_panel.vcf.gz \
    -m /results/training/$EXPERIMENT_NAME \
    -o /results/inference
Inference Parameters
Parameter Description
-p, --panel Inference panel path
-o, --output Output directory
-m, --model Trained model directory

References

  • Lerga-Jaso, J., Novković, B., Unnikrishnan, D., Bamunusinghe, V., Hatorangan, M.R., Manson, C., Pedersen, H., Osama, A., Terpolovsky, A., Bohn, S., De Marino, A., Mahmoud, A.A., Bircan, K.O., Khan, U., Grabherr, M.G., Yazdi, P.G. Retracing Human Genetic Histories and Natural Selection Using Precise Local Ancestry Inference. bioRxiv 2023.09.11.557177; doi: https://doi.org/10.1101/2023.09.11.557177
  • Cuadros-Espinoza, S., Laval, G., Quintana-Murci, L., Patin, E. The genomic signatures of natural selection in admixed human populations. Am. J. Hum. Genet. 109, 710-726 (2022). doi: 10.1016/j.ajhg.2022.02.011; pmid: 35259336

Non-Commercial Use License

Version 1.0

NOTICE

This software is provided free of charge for academic research use only. Any use by commercial entities, for-profit organizations, or consultants is strictly prohibited without prior authorization. For inquiries about commercial licensing, contact [email protected].

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published