A high-performance CLI tool for automating the validation and comparison of biological datasets. This tool calculates accuracy metrics between Reference Alignments and Realignments using vectorized NumPy operations.
For each pair of alignments, the script:
- Reads the FASTA files and normalizes the sequences.
- Converts each alignment into a numeric “coordinate matrix,” where gaps are
0and nucleotides are assigned their 1‑based ungapped position. - Compares the two matrices to count how many positions differ.
- Calculates an overall structural accuracy score.
- Writes all results to a clean CSV file.
Requires Python 3.10+.
Install dependencies:
pip install biopython numpy natsortusage: msa_scorer.py [-h] -r REF -a REAL [-o OUTPUT] [-e EXTENSIONS [EXTENSIONS ...]] [-v] [--strict]
options:
-h, --help show this help message and exit
-r REF, --ref REF Reference alignment directory
-a REAL, --real REAL Realignment directory
-o OUTPUT, --output OUTPUT
Output CSV path (default: results.csv)
-v, --verbose Enable debug logging
--strict Exit pipeline on first errorCompare two directories of alignments:
python msa_scorer.py --ref ref_dir --real real_dirSave results under a custom filename:
python msa_scorer.py -r ref -a real -o comparison_results.csvEnable debugging output:
python msa_scorer.py -r ref -a real -vStop immediately on error:
python msa_scorer.py -r ref -a real --strictSpecify your own set of acceptable file extensions:
python msa_scorer.py -r ref -a real -e .fa .fasta .alnThe script writes a CSV with one row per paired comparison. Each row includes:
simulation_idreference_filerealignment_filetotal_differencestotal_positionsaccuracy_percentsequences_countalignment_length
Example:
0, sample1.fasta, sample1.fa, 42, 10500, 99.60, 12, 880