Feature parallelization by ozlemmuslu · Pull Request #53 · TRON-Bioinformatics/vafator

ozlemmuslu · 2026-03-20T13:41:44Z

vafator 3.1.0

Performance

~13x runtime improvement over v2 on a 292K variant VCF (SEQC2 WES WES_EA_1, 5.5 hours → 23 minutes with 4 cores).

Streaming pileup iterator — replaced per-variant bam.pileup() calls with a single pileup iterator per chromosome per BAM. This eliminates the dominant source of overhead (repeated pysam pileup object construction and destruction) and reduces runtime ~4.4x on its own.
Power calculation caching — _calculate_k and calculate_expected_vaf results are now cached per depth value and per (sample, chrom, pos) respectively, avoiding redundant scipy.stats.binom calls across variants with identical coverage.
Rank sum test early exit — calculate_rank_sum_test now returns nan immediately when either distribution is empty, skipping unnecessary scipy.stats.ranksums calls.
Chromosome-level parallelization — added --num-processes flag to distribute chromosomes across worker processes using ProcessPoolExecutor. Each worker opens its own BAM readers independently, avoiding shared state issues.

Python upgrade

Python upgraded to 3.11
pysam pinned to ==0.21.0 — versions 0.22+ have a regression where query_qualities returns incorrect values.

Breaking changes

--include-ambiguous-bases renamed to --exclude-ambiguous-bases — ambiguous bases (N and all IUPAC codes) are now included in depth of coverage by default. To restore the old behaviour of excluding them, pass --exclude-ambiguous-bases. This flag is inverted from the previous version.
Zero-coverage variants — _bq, _mq, _pos annotations for variants with zero coverage previously reported nan; they now report 0.0. This affects a small number of variants (9/292,009 in the tested VCF).

Code quality

VafatorVariant replaced by VariantRecord — the test utility class VafatorVariant (from vafator.tests.utils) is now a thin factory function returning a VariantRecord.
Async VCF writing removed — the @background / asyncio / time.sleep(2) writing pattern was causing missing variants in the output. Replaced with synchronous batch writing.
Type hints and docstrings added throughout annotator.py, pileups.py, and pileup_utils.py.

* Reprodubility of the output confirmed using WES_EA_1 sample from SEQC2 - a more comprehensive test for INDELs may be necessary * Due ensure reprodubility, the methods get_variant_pileup and get_snv_metrics are updated. * pysam could only be upgraded till 0.21.0, as above this version (up to and including 0.23.3) there are inconsistencies in base qualities. Python was therefore set to 3.11

replace per-variant pileup() calls with a single streaming pileup iterator per BAM per chromosome. Variants are buffered by chromosome and metrics are computed immediately as each pileup column is visited (avoids segfault from storing invalidated PileupColumn C objects). Reduces pysam pileup __init__/__dealloc__ overhead from ~80% of total runtime to negligible.

cache _calculate_k() results by dp value (same depth repeats across thousands of variants). Cache calculate_expected_vaf() by (sample, chrom, pos). Replace frozen binom(n, f).pmf(k) object instantiation with direct binom.pmf(k, n, f) call to avoid scipy distribution object overhead on every variant.

- annotator.py: add --num-processes parameter (default: 1, serial behaviour unchanged). When >1, _run_parallel() submits one future per chromosome to a ProcessPoolExecutor. Workers receive only picklable data: BAM file paths and variant tuples (POS, REF, ALT[0]) — cyvcf2.Variant objects and pysam AlignmentFile handles are not picklable and stay in the main process. - annotator.py: add module-level _collect_metrics_worker() — must be module-level to be picklable by ProcessPoolExecutor. Workers open their own pysam.AlignmentFile instances and reconstruct VariantRecord objects from the serialized tuples before calling collect_metrics_for_chrom(). - annotator.py: refactor run() into _run_serial(), _run_parallel(), _collect_chrom_metrics(), and _annotate_and_batch() for clarity. VCF output order is preserved by collecting futures in submission order before annotating. - pileups.py: replace test utility import (vafator.tests.utils.VafatorVariant) with a proper VariantRecord dataclass defined in pileups.py. Picklable by default, mirrors the .CHROM/.POS/.REF/.ALT interface of cyvcf2.Variant. - pileups.py: add safe_median() helper to guard np.median() calls against empty lists, eliminating RuntimeWarning: Mean of empty slice on positions with no reads supporting a particular allele. - command_line.py: wire --num-processes argument through to Annotator. remove unnecessary try/catch blocks as they would disable the traceback

…guous-bases and make inclusion of ambiguous the default behaviour for accurate depth calculation. This replicates the results of version 2.2.0, but makes it more explicit

1. introduce constants.py and pileup_utils.py to contain constants and helper functions 2. improve readability for VCF header writing 3. add type hints and pydocs

LKress

Many thanks! Only minor questions and suggestions. Feel free to request review after checking.

Edit: Maybe one more thing, check also the "high" issues reported by Codacity to ensure code readability and quality.

LKress · 2026-03-27T18:59:16Z

@@ -1,10 +1,13 @@
 from collections import Counter
 from unittest import TestCase
+from unittest.mock import MagicMock


is this used here?

LKress · 2026-03-27T21:19:00Z

-pybedtools~=0.9.0
-numpy>=1.20,<2.0
-scipy>=1.0.0,<2.0.0
+pandas>=3.0.1,<4


Is pandas used at all? Pandas v3 introduced some major changes that break backward compatibility. If pandas is used, we should check that is works as intended.

pandas is used in hachet2bed, ploidies, and vafator2decifer.

My test runs would not include these, I'm not sure how well they are covered in the unit/integration tests either

LKress · 2026-03-27T21:19:56Z

Nice, thanks for cleaning this up!

LKress · 2026-03-30T11:50:05Z

-
-
-AMBIGUOUS_BASES = ['N', 'M', 'R', 'W', 'S', 'Y', 'K', 'V', 'H', 'D', 'B']
+VERSION = '3.1.0'


Why does this generate v3.1.0? I see that these are breaking changes but wouldn't v3.0.0 be more appropriate?

I bumped the version to 3.0.0 before I made many of the changes, so I thought bumping again would be more appropriate

If there was no release with v3.0.0 it would be best practice to use v3.0.0 here

LKress · 2026-03-30T12:16:27Z

-        pass
+
+    pileup_reads = pileup_col.pileups
+    dp = len(pileup_reads)


This would result in 0 for a if there are no reads covering the position/region, right?

I think so, we need to discuss this

LKress · 2026-03-30T12:17:35Z

+            index = pileup_read.alignment.reference_start
+            relative_position = 0
+            for cigar_type, cigar_length in pileup_read.alignment.cigartuples:
+                if cigar_type in [0, 2, 3, 7, 8]:


I would keep the comment with the information what is consumed

LKress · 2026-03-30T12:17:43Z

+                    index += cigar_length
+                    if index > variant_position:
+                        break
+                if cigar_type in [0, 1, 4, 7, 8]:


same as above

LKress · 2026-03-30T12:20:15Z

+                        mq[alt_upper].append(pileup_read.alignment.mapping_quality)
+                        pos[alt_upper].append(pileup_read.query_position_or_next)
+        elif pileup_read.indel == 0:
+            mq[variant.REF].append(pileup_read.alignment.mapping_quality)


Also here: I would keep the NOTE comment

LKress · 2026-03-30T12:20:56Z

+        if pileup_read.indel < 0:
+            start = pileup_read.alignment.reference_start
+            for cigar_type, cigar_length in pileup_read.alignment.cigartuples:
+                if cigar_type in [0, 3, 7, 8]:


Keep comment

LKress · 2026-03-30T12:22:36Z

+                if start > variant_position:
+                    break
+        elif pileup_read.indel == 0:
+            mq[variant.REF].append(pileup_read.alignment.mapping_quality)


also keep the comment

ibn-salem · 2026-04-02T13:35:11Z

Thanks for all the changes. I did not review the code in detail, but it makes all sense, except this point:

Zero-coverage variants — _bq, _mq, _pos annotations for variants with zero coverage previously reported nan; they now report 0.0. This affects a small number of variants (9/292,009 in the tested VCF).

What is the reason for this change? I would prefere to keep NaN. Thanks!

LKress

Just codacity fixed

LKress · 2026-04-06T11:13:26Z

+        v.INFO["{}_dp".format(s)] = gdp
+        v.INFO["{}_eaf".format(s)] = str(self.power.calculate_expected_vaf(sample=s, variant=v))
+        v.INFO["{}_pu".format(s)] = ",".join([str(self.power.calculate_power(ac=gac[alt], dp=gdp, sample=s, variant=v)) for alt in v.ALT])
+


Suggested change

LKress · 2026-04-06T11:17:13Z

+            for suffix, description, typ, number in _HEADER_TEMPLATES:
+                headers.append(Annotator._make_header(suffix, description, typ, number, sample=s))
+            if len(bams) > 1:
+                for i, bam in enumerate(bams, start=1):


Suggested change

for i, bam in enumerate(bams, start=1):

for i, _ in enumerate(bams, start=1):

Or just iterate over len(bams)

ozlemmuslu · 2026-04-07T12:32:49Z

Thanks for all the changes. I did not review the code in detail, but it makes all sense, except this point:

Zero-coverage variants — _bq, _mq, _pos annotations for variants with zero coverage previously reported nan; they now report 0.0. This affects a small number of variants (9/292,009 in the tested VCF).

What is the reason for this change? I would prefere to keep NaN. Thanks!

Hi @ibn-salem,

This was a side effect of upgrading and the code edits that were necessitated after the upgrade. I would like to discuss this with you and come up with a universal solution for cases when coverage is 0. Currently, other computations (such as the median base quality/mapping quality/position of the reference or alternate allele) also return 0.0 and not NaN, so this is more consistent.

I agree that they should all return NaN, so I opened an issue for it: #51 . I would suggest to release this version, and make these changes with 0.0/NaN in the next version

ozlemmuslu added 22 commits March 13, 2026 10:07

Update gitignore

34f41a8

Introduce null check before rank sum test

2a69e4d

return nan in safe_median to repeat previous results

9d3259e

breaking: change cli argument include-ambiguous-bases to exclude-ambi…

a4f2288

…guous-bases and make inclusion of ambiguous the default behaviour for accurate depth calculation. This replicates the results of version 2.2.0, but makes it more explicit

Setup

8f62e75

Update unit tests

8f8f1c3

Update python version in tests

c2370d2

more flexible Python version

9e3d099

more flexible Python version

534ebc0

update test with new method

8ab4514

Fix tests

82606ed

fix insertion tests

f5ea6b2

attempt to fix test

abefa7b

remove insertion/deletion tests. will add an issue to the repository

541542f

Refactor the code for better readability

cd8bb35

1. introduce constants.py and pileup_utils.py to contain constants and helper functions 2. improve readability for VCF header writing 3. add type hints and pydocs

Fix VCF INFO field error

e198540

Fix wrong tag

5e3344d

bump version

6db73d6

ozlemmuslu requested a review from LKress March 20, 2026 13:41

ozlemmuslu assigned ibn-salem Mar 20, 2026

ozlemmuslu requested a review from ibn-salem March 20, 2026 13:42

ozlemmuslu assigned ozlemmuslu and unassigned ibn-salem Mar 20, 2026

Readd test that was accidentally deleted

12f017a

LKress requested changes Mar 30, 2026

View reviewed changes

improve code quality based on Codacy output

4b6a272

ozlemmuslu added 2 commits April 2, 2026 12:56

more codacy improvements

8b80837

Re-add comments

cf31f1e

LKress approved these changes Apr 7, 2026

View reviewed changes

ozlemmuslu added 2 commits April 8, 2026 15:20

black code formatting

8803b48

fix unused variable bam

a3e5e20

ibn-salem approved these changes Apr 8, 2026

View reviewed changes

ozlemmuslu added 4 commits April 9, 2026 11:37

fix Codacy errors

2bcc602

add seed for random class. fixes #54

c0254d0

change version to 3.0.0

eb17614

add support for CRAM files. closes #45

f239072

ozlemmuslu merged commit d67f9b5 into master Apr 10, 2026
2 of 3 checks passed



		AMBIGUOUS_BASES = ['N', 'M', 'R', 'W', 'S', 'Y', 'K', 'V', 'H', 'D', 'B']
		VERSION = '3.1.0'

	for i, bam in enumerate(bams, start=1):
	for i, _ in enumerate(bams, start=1):

Conversation

ozlemmuslu commented Mar 20, 2026

vafator 3.1.0

Performance

Python upgrade

Breaking changes

Code quality

Uh oh!

LKress left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibn-salem commented Apr 2, 2026

Uh oh!

LKress left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ozlemmuslu commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LKress left a comment •

edited

Loading