Feature parallelization#53
Conversation
* Reprodubility of the output confirmed using WES_EA_1 sample from SEQC2 - a more comprehensive test for INDELs may be necessary * Due ensure reprodubility, the methods get_variant_pileup and get_snv_metrics are updated. * pysam could only be upgraded till 0.21.0, as above this version (up to and including 0.23.3) there are inconsistencies in base qualities. Python was therefore set to 3.11
replace per-variant pileup() calls with a single streaming pileup iterator per BAM per chromosome. Variants are buffered by chromosome and metrics are computed immediately as each pileup column is visited (avoids segfault from storing invalidated PileupColumn C objects). Reduces pysam pileup __init__/__dealloc__ overhead from ~80% of total runtime to negligible.
cache _calculate_k() results by dp value (same depth repeats across thousands of variants). Cache calculate_expected_vaf() by (sample, chrom, pos). Replace frozen binom(n, f).pmf(k) object instantiation with direct binom.pmf(k, n, f) call to avoid scipy distribution object overhead on every variant.
- annotator.py: add --num-processes parameter (default: 1, serial behaviour unchanged). When >1, _run_parallel() submits one future per chromosome to a ProcessPoolExecutor. Workers receive only picklable data: BAM file paths and variant tuples (POS, REF, ALT[0]) — cyvcf2.Variant objects and pysam AlignmentFile handles are not picklable and stay in the main process. - annotator.py: add module-level _collect_metrics_worker() — must be module-level to be picklable by ProcessPoolExecutor. Workers open their own pysam.AlignmentFile instances and reconstruct VariantRecord objects from the serialized tuples before calling collect_metrics_for_chrom(). - annotator.py: refactor run() into _run_serial(), _run_parallel(), _collect_chrom_metrics(), and _annotate_and_batch() for clarity. VCF output order is preserved by collecting futures in submission order before annotating. - pileups.py: replace test utility import (vafator.tests.utils.VafatorVariant) with a proper VariantRecord dataclass defined in pileups.py. Picklable by default, mirrors the .CHROM/.POS/.REF/.ALT interface of cyvcf2.Variant. - pileups.py: add safe_median() helper to guard np.median() calls against empty lists, eliminating RuntimeWarning: Mean of empty slice on positions with no reads supporting a particular allele. - command_line.py: wire --num-processes argument through to Annotator. remove unnecessary try/catch blocks as they would disable the traceback
…guous-bases and make inclusion of ambiguous the default behaviour for accurate depth calculation. This replicates the results of version 2.2.0, but makes it more explicit
1. introduce constants.py and pileup_utils.py to contain constants and helper functions 2. improve readability for VCF header writing 3. add type hints and pydocs
| @@ -1,10 +1,13 @@ | |||
| from collections import Counter | |||
| from unittest import TestCase | |||
| from unittest.mock import MagicMock | |||
| pybedtools~=0.9.0 | ||
| numpy>=1.20,<2.0 | ||
| scipy>=1.0.0,<2.0.0 No newline at end of file | ||
| pandas>=3.0.1,<4 |
There was a problem hiding this comment.
Is pandas used at all? Pandas v3 introduced some major changes that break backward compatibility. If pandas is used, we should check that is works as intended.
There was a problem hiding this comment.
pandas is used in hachet2bed, ploidies, and vafator2decifer.
My test runs would not include these, I'm not sure how well they are covered in the unit/integration tests either
There was a problem hiding this comment.
Nice, thanks for cleaning this up!
|
|
||
|
|
||
| AMBIGUOUS_BASES = ['N', 'M', 'R', 'W', 'S', 'Y', 'K', 'V', 'H', 'D', 'B'] | ||
| VERSION = '3.1.0' |
There was a problem hiding this comment.
Why does this generate v3.1.0? I see that these are breaking changes but wouldn't v3.0.0 be more appropriate?
There was a problem hiding this comment.
I bumped the version to 3.0.0 before I made many of the changes, so I thought bumping again would be more appropriate
There was a problem hiding this comment.
If there was no release with v3.0.0 it would be best practice to use v3.0.0 here
| pass | ||
|
|
||
| pileup_reads = pileup_col.pileups | ||
| dp = len(pileup_reads) |
There was a problem hiding this comment.
This would result in 0 for a if there are no reads covering the position/region, right?
There was a problem hiding this comment.
I think so, we need to discuss this
| index = pileup_read.alignment.reference_start | ||
| relative_position = 0 | ||
| for cigar_type, cigar_length in pileup_read.alignment.cigartuples: | ||
| if cigar_type in [0, 2, 3, 7, 8]: |
There was a problem hiding this comment.
I would keep the comment with the information what is consumed
| index += cigar_length | ||
| if index > variant_position: | ||
| break | ||
| if cigar_type in [0, 1, 4, 7, 8]: |
| mq[alt_upper].append(pileup_read.alignment.mapping_quality) | ||
| pos[alt_upper].append(pileup_read.query_position_or_next) | ||
| elif pileup_read.indel == 0: | ||
| mq[variant.REF].append(pileup_read.alignment.mapping_quality) |
There was a problem hiding this comment.
Also here: I would keep the NOTE comment
| if pileup_read.indel < 0: | ||
| start = pileup_read.alignment.reference_start | ||
| for cigar_type, cigar_length in pileup_read.alignment.cigartuples: | ||
| if cigar_type in [0, 3, 7, 8]: |
| if start > variant_position: | ||
| break | ||
| elif pileup_read.indel == 0: | ||
| mq[variant.REF].append(pileup_read.alignment.mapping_quality) |
|
Thanks for all the changes. I did not review the code in detail, but it makes all sense, except this point:
What is the reason for this change? I would prefere to keep NaN. Thanks! |
| v.INFO["{}_dp".format(s)] = gdp | ||
| v.INFO["{}_eaf".format(s)] = str(self.power.calculate_expected_vaf(sample=s, variant=v)) | ||
| v.INFO["{}_pu".format(s)] = ",".join([str(self.power.calculate_power(ac=gac[alt], dp=gdp, sample=s, variant=v)) for alt in v.ALT]) | ||
|
|
| for suffix, description, typ, number in _HEADER_TEMPLATES: | ||
| headers.append(Annotator._make_header(suffix, description, typ, number, sample=s)) | ||
| if len(bams) > 1: | ||
| for i, bam in enumerate(bams, start=1): |
There was a problem hiding this comment.
| for i, bam in enumerate(bams, start=1): | |
| for i, _ in enumerate(bams, start=1): |
Or just iterate over len(bams)
Hi @ibn-salem, This was a side effect of upgrading and the code edits that were necessitated after the upgrade. I would like to discuss this with you and come up with a universal solution for cases when coverage is 0. Currently, other computations (such as the median base quality/mapping quality/position of the reference or alternate allele) also return 0.0 and not NaN, so this is more consistent. I agree that they should all return NaN, so I opened an issue for it: #51 . I would suggest to release this version, and make these changes with 0.0/NaN in the next version |
vafator 3.1.0
Performance
~13x runtime improvement over v2 on a 292K variant VCF (SEQC2 WES WES_EA_1, 5.5 hours → 23 minutes with 4 cores).
bam.pileup()calls with a single pileup iterator per chromosome per BAM. This eliminates the dominant source of overhead (repeated pysam pileup object construction and destruction) and reduces runtime ~4.4x on its own._calculate_kandcalculate_expected_vafresults are now cached per depth value and per (sample, chrom, pos) respectively, avoiding redundantscipy.stats.binomcalls across variants with identical coverage.calculate_rank_sum_testnow returnsnanimmediately when either distribution is empty, skipping unnecessaryscipy.stats.ranksumscalls.--num-processesflag to distribute chromosomes across worker processes usingProcessPoolExecutor. Each worker opens its own BAM readers independently, avoiding shared state issues.Python upgrade
==0.21.0— versions 0.22+ have a regression wherequery_qualitiesreturns incorrect values.Breaking changes
--include-ambiguous-basesrenamed to--exclude-ambiguous-bases— ambiguous bases (N and all IUPAC codes) are now included in depth of coverage by default. To restore the old behaviour of excluding them, pass--exclude-ambiguous-bases. This flag is inverted from the previous version._bq,_mq,_posannotations for variants with zero coverage previously reportednan; they now report0.0. This affects a small number of variants (9/292,009 in the tested VCF).Code quality
VafatorVariantreplaced byVariantRecord— the test utility classVafatorVariant(fromvafator.tests.utils) is now a thin factory function returning aVariantRecord.@background/asyncio/time.sleep(2)writing pattern was causing missing variants in the output. Replaced with synchronous batch writing.annotator.py,pileups.py, andpileup_utils.py.