Skip to content

Conversation

tedsharpe
Copy link
Contributor

No description provided.

@SHuang-Broad SHuang-Broad self-requested a review September 27, 2023 16:13
Copy link
Collaborator

@SHuang-Broad SHuang-Broad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tedsharpe , thanks for the cleanup!
I have some questions and think some small changes are needed.

Let me know!
BTW, the automated tests are catching these errors
https://github.com/broadinstitute/long-read-pipelines/actions/runs/6225233549/job/16895206594?pr=422#step:7:312


# install conda packages
COPY ./environment.yml /
RUN conda install -n base conda-libmamba-solver && conda config --set solver libmamba
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with how mamba helps dependencies, in particular where it should be installed.
But if you look at the environment.yml, it's trying to create an env named lr-metrics, and here you're installing mamba into the base env.
Is this usually what people do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been reverted, so it's no longer relevant, but here's some info for posterity.
I tried to build this docker on a desktop with 16Gb of memory. No dice. If I was lucky it just got OOM'd after a few hours. So I ran out to microcenter and bought 32Gb of memory. After leaving it overnight it was still trying to solve the environment. Added libmamba and the docker built promptly (a few minutes--don't remember exactly) using little memory. Don't know about the correctness of the environments (but I'd think you'd want the solver in the base environment).

RUN conda env create -f /environment.yml && conda clean -a

# install gatk
# install super-special version of gatk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a ticket for this to help us track #427

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current version eliminates it.

preemptible_tries: 2,
max_retries: 1,
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.1"
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we have 0.3.2 on GCR, but given that the main branch is using 0.3.1, we can do one of two things here:

  • not updating the version here
  • find out what 0.3.2 changed from 0.3.1. I see you included several more packages in lr-mosdepth, so if you built that, then let's also bump the version in the makefile in that docker foler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makefile version bumped.

preemptible_tries: 2,
max_retries: 1,
docker: "us.gcr.io/broad-dsp-lrma/lr-metrics:0.1.11"
docker: "us.gcr.io/broad-dsp-lrma/lr-mosdepth:0.3.2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, mosdepth's docker is being updated. Can we look into what changed in 0.3.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a couple of packages so that we didn't have to needlessly break this task in two.

set -euxo pipefail
java -jar /usr/local/bin/gatk.jar ComputeLongReadMetrics -I ~{bam} -O ~{basename}.read_metrics -DF WellformedReadFilter
samtools flagstat ~{bam} > ~{basename}.flag_stats.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great.
Should it have its own task, though? I know it's always a balance (localization vs. separation of concerns).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an astounding amount of overhead in invoking a task. Provisioning the VM, downloading the docker image, checking the call cache database, localizing input files, etc., etc., etc. It's especially true of small tasks. For example, in this workflow the task MakeChrIntervalList (an extreme example, I admit) runs for 2 minutes. Of this, the useful work part (UserAction) takes 1.36 seconds. I'm voting for a rule that says, "You combine all the operations that process the same data into a single task unless it's a very long-running task that might escape preemption or significantly reduce workflow real clock time by being parallelized."

RUN wget -O /usr/local/bin/picard.jar https://github.com/broadinstitute/picard/releases/download/2.22.1/picard.jar

# install gsutil
RUN curl -sSL https://sdk.cloud.google.com | bash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this command, just a convention in this repo.

We usually bump the version in the companion make file when the Dockerfile/environment file are changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted.


import "../../structs/Structs.wdl"

workflow MergeVCFs {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this taken from the GATK-SV pipeline?

And is this integrated into other pipelines in this repo?
I checked out your branch and deleted that file, then went to validate all pipelines. All WDLs validated, so I suspect it is not integrated.

If it is indeed not integrated into other pipelines here, do you mind making a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something I wrote for the GMKF data import. It doesn't belong in this PR. I'll remove it.

@tedsharpe
Copy link
Contributor Author

Thanks for the review, Steve. I think we're probably ready for another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants