Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing and timeouts #8

Merged
merged 10 commits into from
May 7, 2024
Merged

Conversation

theferrit32
Copy link
Contributor

@theferrit32 theferrit32 commented Apr 30, 2024

  • Adds catvar_combiner.py (which can be adapted and genericized later) to combine a number of NDJSON files into a single file with a single JSON document with keys being the id values from each line of the NDJSON file.
  • Define some logic to generate a local relative path for caching gs:// files. Default is buckets/<bucket>/<blob-prefix>/<blob-basename>. e.g. "gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz" gets cached to ./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
  • Add logic to only re-download a gs:// file if it doesn't already exist in the default local cache directory.
  • Write output files to output directory, with the same relative path under there as the input file. e.g. gs://clinvar-gk-pilot/2024-04-07/dev/vi.json.gz gets cached to ./buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz and the output gets written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
  • Add optional parallelism. Partitions the input file into N number of files with equal numbers of lines, and executes a process over each of those partitioned files. Takes the number of partitions with the CLI arg --parallelism
  • When --parallelism is not 0, also runs each task (e.g. process_line(line) for each line of input) in a separate process which can be interrupted after some timeout. This lets us stop normalization of variants that take too long because they are nonsensical (e.g. deleting an N inside a huge N region of the genomic reference sequence. see Provide way to stop normalization if the expression is obviously problematic (such as deletions in large gap/unknown regions) ga4gh/vrs-python#397)

@theferrit32 theferrit32 added the enhancement New feature or request label Apr 30, 2024
@theferrit32 theferrit32 self-assigned this Apr 30, 2024
@theferrit32 theferrit32 force-pushed the multiprocess-and-timeouts branch from 5dba6e0 to cead782 Compare April 30, 2024 18:18
@theferrit32
Copy link
Contributor Author

Ran this with the current args at the bottom of main.py and it finished in about the ~2.8 million variants in ~21 minutes, having skipped 47 variants which took longer than 10 seconds.

@theferrit32
Copy link
Contributor Author

Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_1.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280054
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_2.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_3.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_4.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_5.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_6.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_7.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_8.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_9.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Writing output from buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz.part_10.out to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Lines written: 280053
Output written to output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz
Output uploaded to gs://clinvar-gk-pilot/2024-04-07/dev/vi-output.json.gz
python clinvar_gk_pilot/main.py 2>&1  5730.73s user 2977.07s system 629% cpu 23:02.91 total
tee log  0.00s user 0.01s system 0% cpu 23:02.96 total

errors due to task timeout:

zgrep -rn "errors" output/buckets/clinvar-gk-pilot/2024-04-07/dev/vi.json.gz | grep "did not complete" | wc -l
5

@toneillbroad
Copy link
Contributor

I identified four variants that were causing long unknown region processing times that were causing "runaway" processing. 47 seems too large a number.

{"variation_id":"1687628","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2831_2832dup","precedence":"4","variation_type":"Duplication","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}
{"variation_id":"1687107","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.5185del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}
{"variation_id":"1691679","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.2897_2953del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}
{"variation_id":"1691680","name":"Single allele","assembly_version":"37","accession":"NC_000015.9","vrs_class":"Allele","range_copies":[],"fmt":"hgvs","source":"NC_000015.9:g.7211_7214del","precedence":"4","variation_type":"Deletion","subclass_type":"SimpleAllele","cytogenetic":"15p13","chr":"15","mappings":[]}

@theferrit32
Copy link
Contributor Author

Thanks for the info on those, @toneillbroad. With a 1 minute timeout I got those same 4, which is good validation, plus 1 other one, variation_id 11668.

2565051:{"errors": "Task did not complete in 60 seconds.", "line": "{\"variation_id\":\"11668\",\"name\":\"NM_004586.3(RPS6KA3):c.1444_1959dup (p.Val482_Lys653dup)\",\"accession\":\"NG_007488.1\",\"vrs_class\":\"Allele\",\"range_copies\":[],\"fmt\":\"hgvs\",\"source\":\"NG_007488.1:g.103742_114797dup\",\"precedence\":\"5\",\"variation_type\":\"Duplication\",\"subclass_type\":\"SimpleAllele\",\"cytogenetic\":\"Xp22.2-p22.1\",\"mappings\":[]}\n"}

I'm not sure why this one took longer than a minute, the reference sequence is only 515 bases.

@theferrit32 theferrit32 merged commit 2b1fca7 into main May 7, 2024
2 checks passed
@theferrit32 theferrit32 deleted the multiprocess-and-timeouts branch May 7, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants