Skip to content

Feature parallelization#53

Merged
ozlemmuslu merged 32 commits into
masterfrom
feature-parallelization
Apr 10, 2026
Merged

Feature parallelization#53
ozlemmuslu merged 32 commits into
masterfrom
feature-parallelization

Conversation

@ozlemmuslu
Copy link
Copy Markdown
Member

vafator 3.1.0

Performance

~13x runtime improvement over v2 on a 292K variant VCF (SEQC2 WES WES_EA_1, 5.5 hours → 23 minutes with 4 cores).

  • Streaming pileup iterator — replaced per-variant bam.pileup() calls with a single pileup iterator per chromosome per BAM. This eliminates the dominant source of overhead (repeated pysam pileup object construction and destruction) and reduces runtime ~4.4x on its own.
  • Power calculation caching_calculate_k and calculate_expected_vaf results are now cached per depth value and per (sample, chrom, pos) respectively, avoiding redundant scipy.stats.binom calls across variants with identical coverage.
  • Rank sum test early exitcalculate_rank_sum_test now returns nan immediately when either distribution is empty, skipping unnecessary scipy.stats.ranksums calls.
  • Chromosome-level parallelization — added --num-processes flag to distribute chromosomes across worker processes using ProcessPoolExecutor. Each worker opens its own BAM readers independently, avoiding shared state issues.

Python upgrade

  • Python upgraded to 3.11
  • pysam pinned to ==0.21.0 — versions 0.22+ have a regression where query_qualities returns incorrect values.

Breaking changes

  • --include-ambiguous-bases renamed to --exclude-ambiguous-bases — ambiguous bases (N and all IUPAC codes) are now included in depth of coverage by default. To restore the old behaviour of excluding them, pass --exclude-ambiguous-bases. This flag is inverted from the previous version.
  • Zero-coverage variants_bq, _mq, _pos annotations for variants with zero coverage previously reported nan; they now report 0.0. This affects a small number of variants (9/292,009 in the tested VCF).

Code quality

  • VafatorVariant replaced by VariantRecord — the test utility class VafatorVariant (from vafator.tests.utils) is now a thin factory function returning a VariantRecord.
  • Async VCF writing removed — the @background / asyncio / time.sleep(2) writing pattern was causing missing variants in the output. Replaced with synchronous batch writing.
  • Type hints and docstrings added throughout annotator.py, pileups.py, and pileup_utils.py.

* Reprodubility of the output confirmed using WES_EA_1 sample from SEQC2 - a more comprehensive test for INDELs may be necessary
* Due ensure reprodubility, the methods get_variant_pileup and get_snv_metrics are updated.
* pysam could only be upgraded till 0.21.0, as above this version (up to and including 0.23.3) there are inconsistencies in base qualities. Python was therefore set to 3.11
replace per-variant pileup() calls with a
single streaming pileup iterator per BAM per chromosome. Variants are
buffered by chromosome and metrics are computed immediately as each
pileup column is visited (avoids segfault from storing invalidated
PileupColumn C objects). Reduces pysam pileup __init__/__dealloc__
overhead from ~80% of total runtime to negligible.
cache _calculate_k() results by dp value (same depth
  repeats across thousands of variants). Cache calculate_expected_vaf()
  by (sample, chrom, pos). Replace frozen binom(n, f).pmf(k) object
  instantiation with direct binom.pmf(k, n, f) call to avoid scipy
  distribution object overhead on every variant.
- annotator.py: add --num-processes parameter (default: 1, serial behaviour
  unchanged). When >1, _run_parallel() submits one future per chromosome to
  a ProcessPoolExecutor. Workers receive only picklable data: BAM file paths
  and variant tuples (POS, REF, ALT[0]) — cyvcf2.Variant objects and pysam
  AlignmentFile handles are not picklable and stay in the main process.

- annotator.py: add module-level _collect_metrics_worker() — must be
  module-level to be picklable by ProcessPoolExecutor. Workers open their
  own pysam.AlignmentFile instances and reconstruct VariantRecord objects
  from the serialized tuples before calling collect_metrics_for_chrom().

- annotator.py: refactor run() into _run_serial(), _run_parallel(),
  _collect_chrom_metrics(), and _annotate_and_batch() for clarity. VCF
  output order is preserved by collecting futures in submission order before
  annotating.

- pileups.py: replace test utility import (vafator.tests.utils.VafatorVariant)
  with a proper VariantRecord dataclass defined in pileups.py. Picklable by
  default, mirrors the .CHROM/.POS/.REF/.ALT interface of cyvcf2.Variant.

- pileups.py: add safe_median() helper to guard np.median() calls against
  empty lists, eliminating RuntimeWarning: Mean of empty slice on positions
  with no reads supporting a particular allele.

- command_line.py: wire --num-processes argument through to Annotator. remove unnecessary try/catch blocks as they would disable the traceback
…guous-bases and make inclusion of ambiguous the default behaviour for accurate depth calculation.

This replicates the results of version 2.2.0, but makes it more explicit
1. introduce constants.py and pileup_utils.py to contain constants and helper functions
2. improve readability for VCF header writing
3. add type hints and pydocs
Copy link
Copy Markdown
Contributor

@LKress LKress left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks! Only minor questions and suggestions. Feel free to request review after checking.

Edit: Maybe one more thing, check also the "high" issues reported by Codacity to ensure code readability and quality.

Comment thread vafator/tests/test_pileups.py Outdated
@@ -1,10 +1,13 @@
from collections import Counter
from unittest import TestCase
from unittest.mock import MagicMock
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used here?

Comment thread requirements.txt
pybedtools~=0.9.0
numpy>=1.20,<2.0
scipy>=1.0.0,<2.0.0 No newline at end of file
pandas>=3.0.1,<4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is pandas used at all? Pandas v3 introduced some major changes that break backward compatibility. If pandas is used, we should check that is works as intended.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas is used in hachet2bed, ploidies, and vafator2decifer.

My test runs would not include these, I'm not sure how well they are covered in the unit/integration tests either

Comment thread setup.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for cleaning this up!

Comment thread vafator/__init__.py Outdated


AMBIGUOUS_BASES = ['N', 'M', 'R', 'W', 'S', 'Y', 'K', 'V', 'H', 'D', 'B']
VERSION = '3.1.0'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this generate v3.1.0? I see that these are breaking changes but wouldn't v3.0.0 be more appropriate?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bumped the version to 3.0.0 before I made many of the changes, so I thought bumping again would be more appropriate

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was no release with v3.0.0 it would be best practice to use v3.0.0 here

Comment thread vafator/pileups.py
pass

pileup_reads = pileup_col.pileups
dp = len(pileup_reads)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would result in 0 for a if there are no reads covering the position/region, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, we need to discuss this

Comment thread vafator/pileups.py Outdated
index = pileup_read.alignment.reference_start
relative_position = 0
for cigar_type, cigar_length in pileup_read.alignment.cigartuples:
if cigar_type in [0, 2, 3, 7, 8]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the comment with the information what is consumed

Comment thread vafator/pileups.py Outdated
index += cigar_length
if index > variant_position:
break
if cigar_type in [0, 1, 4, 7, 8]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment thread vafator/pileups.py
mq[alt_upper].append(pileup_read.alignment.mapping_quality)
pos[alt_upper].append(pileup_read.query_position_or_next)
elif pileup_read.indel == 0:
mq[variant.REF].append(pileup_read.alignment.mapping_quality)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here: I would keep the NOTE comment

Comment thread vafator/pileups.py Outdated
if pileup_read.indel < 0:
start = pileup_read.alignment.reference_start
for cigar_type, cigar_length in pileup_read.alignment.cigartuples:
if cigar_type in [0, 3, 7, 8]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep comment

Comment thread vafator/pileups.py
if start > variant_position:
break
elif pileup_read.indel == 0:
mq[variant.REF].append(pileup_read.alignment.mapping_quality)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also keep the comment

@ibn-salem
Copy link
Copy Markdown
Member

Thanks for all the changes. I did not review the code in detail, but it makes all sense, except this point:

Zero-coverage variants — _bq, _mq, _pos annotations for variants with zero coverage previously reported nan; they now report 0.0. This affects a small number of variants (9/292,009 in the tested VCF).

What is the reason for this change? I would prefere to keep NaN. Thanks!

Copy link
Copy Markdown
Contributor

@LKress LKress left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just codacity fixed

Comment thread vafator/annotator.py Outdated
v.INFO["{}_dp".format(s)] = gdp
v.INFO["{}_eaf".format(s)] = str(self.power.calculate_expected_vaf(sample=s, variant=v))
v.INFO["{}_pu".format(s)] = ",".join([str(self.power.calculate_power(ac=gac[alt], dp=gdp, sample=s, variant=v)) for alt in v.ALT])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment thread vafator/annotator.py Outdated
for suffix, description, typ, number in _HEADER_TEMPLATES:
headers.append(Annotator._make_header(suffix, description, typ, number, sample=s))
if len(bams) > 1:
for i, bam in enumerate(bams, start=1):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i, bam in enumerate(bams, start=1):
for i, _ in enumerate(bams, start=1):

Or just iterate over len(bams)

@ozlemmuslu
Copy link
Copy Markdown
Member Author

Thanks for all the changes. I did not review the code in detail, but it makes all sense, except this point:

Zero-coverage variants — _bq, _mq, _pos annotations for variants with zero coverage previously reported nan; they now report 0.0. This affects a small number of variants (9/292,009 in the tested VCF).

What is the reason for this change? I would prefere to keep NaN. Thanks!

Hi @ibn-salem,

This was a side effect of upgrading and the code edits that were necessitated after the upgrade. I would like to discuss this with you and come up with a universal solution for cases when coverage is 0. Currently, other computations (such as the median base quality/mapping quality/position of the reference or alternate allele) also return 0.0 and not NaN, so this is more consistent.

I agree that they should all return NaN, so I opened an issue for it: #51 . I would suggest to release this version, and make these changes with 0.0/NaN in the next version

@ozlemmuslu ozlemmuslu merged commit d67f9b5 into master Apr 10, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants