-
Notifications
You must be signed in to change notification settings - Fork 365
Update DKFZ configuration and add GPU support #1125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,33 +1,89 @@ | ||
| // Institutional profile for the DKFZ / ODCF LSF cluster. | ||
|
|
||
| params { | ||
| config_profile_description = 'Deutsches Krebsforschungszentrum (DKFZ) HPC cluster profile provided by nf-core/configs' | ||
| config_profile_contact = 'Kübra Narcı kuebra.narci@dkfz-heidelberg.de' | ||
| config_profile_name = 'DKFZ cluster' | ||
| config_profile_description = 'Deutsches Krebsforschungszentrum (DKFZ) ODCF HPC cluster profile' | ||
| config_profile_contact = 'Abid Abrar (abid.abrar@dkfz-heidelberg.de), Kübra Narcı (kuebra.narci@dkfz-heidelberg.de)' | ||
| config_profile_name = 'DKFZ Cluster' | ||
| config_profile_url = 'https://www.dkfz.de' | ||
|
|
||
| max_cpus = 64 | ||
| max_memory = '1000.GB' | ||
| max_time = '720.h' | ||
|
|
||
| max_cpus = 30 | ||
| max_memory = '250.GB' | ||
| max_time = '48.h' | ||
| // GPU queue for GPU jobs (options: gpu (default), gpu-lowprio, gpu-pro) | ||
| dkfz_gpu_queue = 'gpu' | ||
|
jfy133 marked this conversation as resolved.
|
||
| } | ||
|
|
||
|
|
||
| singularity { | ||
|
jfy133 marked this conversation as resolved.
|
||
| apptainer { | ||
| enabled = true | ||
| autoMounts = true | ||
| } | ||
|
|
||
| // Ignore the custom dkfz_gpu_queue param in nf-schema validation | ||
| validation.ignoreParams = ['dkfz_gpu_queue'] | ||
|
|
||
| process { | ||
| executor = 'lsf' | ||
| scratch = '$CLUSTER_SCRATCHDIR' | ||
|
|
||
| // Retry transient failures: no exit status, signals 130–145 (137 = OOM/preempt), 104/255 (I/O drops) | ||
| errorStrategy = { (task.exitStatus == null || task.exitStatus == Integer.MAX_VALUE || task.exitStatus in ((130..145) + [104, 255])) ? 'retry' : 'finish' } | ||
| maxRetries = 3 | ||
| cache = 'lenient' | ||
|
|
||
| // Cap every task to the cluster ceiling: 64 cores, 1000 GB RAM, 720 h (30 day) wall time | ||
| resourceLimits = [ | ||
| memory: 250.GB, | ||
| cpus: 30, | ||
| time: 48.h | ||
| cpus : 64, | ||
| memory: 1000.GB, | ||
| time : 720.h, | ||
| ] | ||
| executor = 'lsf' | ||
| scratch = '$SCRATCHDIR/$LSB_JOBID' | ||
|
|
||
| // Low defaults for unlabelled processes | ||
| cpus = 1 | ||
| memory = 6.GB | ||
| time = 10.min | ||
|
|
||
| // GPU tasks go to a GPU queue; everything else to a CPU queue by time/memory. | ||
| queue = { | ||
| if (task.accelerator) { | ||
| return params.dkfz_gpu_queue | ||
| } else if (task.memory && task.memory > 200.GB) { | ||
| return 'highmem' | ||
| } else if (!task.time || task.time <= 10.min) { | ||
| return 'short' | ||
| } else if (task.time <= 1.h) { | ||
| return 'medium' | ||
| } else if (task.time <= 10.h) { | ||
| return 'long' | ||
| } else { | ||
| return 'verylong' | ||
| } | ||
| } | ||
|
|
||
| // GPU request, depends on `accelerator`: a nf-core `process_gpu` task without | ||
| // `-profile gpu` has no accelerator, so it stays on CPU. | ||
| // j_exclusive=yes is mandatory | ||
| // optional `ext.gpu_memory` pins to GPUs with at least that much VRAM. | ||
| clusterOptions = { | ||
| if (!task.accelerator) { | ||
| return null | ||
| } | ||
| def gpu = "-gpu num=${task.accelerator.request}:j_exclusive=yes" | ||
| if (task.ext.gpu_memory) { | ||
| gpu += ":gmem=${task.ext.gpu_memory.toGiga()}G" | ||
| } | ||
| return gpu | ||
| } | ||
|
|
||
| // Bind /omics into every container; add --nv for GPU tasks. | ||
| containerOptions = { task.accelerator ? '--bind /omics --nv' : '--bind /omics' } | ||
| } | ||
|
|
||
| executor { | ||
| name = 'lsf' | ||
| perTaskReserve = false | ||
| perJobMemLimit = true | ||
| perTaskReserve = false | ||
| queueSize = 10 | ||
| submitRateLimit = '3 sec' | ||
| submitRateLimit = '1 sec' | ||
| exitReadTimeout = '30 min' | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,15 +1,148 @@ | ||
| # nf-core/configs: DKFZ configuration | ||
|
|
||
| This configuration specifies hardware and infrastructure access at the Deutsches Krebsforschungszentrum (DKFZ) HPC cluster in Heidelberg / Germany. | ||
| To use, run the pipeline with `-profile dkfz`. This will download and launch the [`dkfz.config`](../conf/dkfz.config), pre-configured for the Deutsches Krebsforschungszentrum (DKFZ) / ODCF LSF cluster in Heidelberg, Germany. | ||
|
|
||
| To use, run the pipeline with `-profile dkfz`. This will download and launch the dkfz.config which has been pre-configured with a setup suitable for the cluster. Using this profile, either Singularity containers are pulled from public repositories or a Docker image containing all of the required software will be downloaded, and converted to a Singularity image before execution of the pipeline. | ||
| This configuration is tested with Nextflow 25.10.0 (available on the cluster as a module). | ||
|
|
||
| > :warning: Before using the confuguratiron, add Singularity environment options (SINGULARITY_CACHEDIR and SINGULARITY_LIBRARYDIR) to `env` | ||
| The profile only configures the cluster itself (LSF executor, dynamic queue selection, scratch, resource limits and the `/omics` bind-mount). Pick a container engine on the command line, e.g. `-profile dkfz,apptainer` or `-profile dkfz,conda`. | ||
|
|
||
| > :warning: Before running the pipeline you will need to load Nextflow using the environment module system. Please check the main README of the pipeline to make sure that the version of Nextflow is compatible with that required to run the pipeline. You can do this by issuing the commands below: | ||
| > :warning: **Use Apptainer/Singularity (or Conda), not Docker.** On the ODCF cluster Docker is only available through LSF's `docker-generic` application profile. Nextflow's `docker` executor runs `docker run` directly on the node, which this setup does not allow, so `-profile dkfz,docker` will not work. Use `-profile dkfz,apptainer` instead. | ||
|
|
||
| ## Before you use this profile | ||
|
|
||
| 1. Load Nextflow via the environment module system on a submission host. Check the pipeline's README for the required Nextflow version: | ||
|
|
||
| ```bash | ||
| module load Nextflow/25.10.0 | ||
| ``` | ||
|
|
||
| 2. Submit from a submission host (`bsub01.lsf.dkfz.de` / `bsub02.lsf.dkfz.de`). Do **not** run heavy work on the login/worker nodes. Wrap the Nextflow driver itself in a `bsub` job (see below). | ||
|
|
||
| 3. The shared `/omics` filesystem is bind-mounted into every container automatically. If your inputs or references live elsewhere, point `NXF_APPTAINER_CACHEDIR` / `NXF_SINGULARITY_CACHEDIR` at a path under `/omics` so images are cached on shared storage: | ||
|
|
||
| ```bash | ||
| export NXF_APPTAINER_CACHEDIR=/omics/groups/<your-group>/.../apptainer_cache | ||
| ``` | ||
|
|
||
| ## Queues | ||
|
|
||
| Queue selection is automatic, based on each task's requested `time` and `memory`: | ||
|
|
||
| | Queue | Selected when | Limit | | ||
| | ---------- | ----------------------------------- | ----------------- | | ||
| | `short` | no time given, or `time <= 10.min` | 10 min | | ||
| | `medium` | `time <= 1.h` | 1 hour | | ||
| | `long` | `time <= 10.h` | 10 hours | | ||
| | `verylong` | `time > 10.h` | no hard limit | | ||
| | `highmem` | `memory > 200.GB` | up to ~4 TB | | ||
|
|
||
| Note: `highmem` is the only queue that accepts requests above 200 GB (and it rejects requests below 200 GB). | ||
|
|
||
| ## Resource limits, retries and containers | ||
|
|
||
| - Every task is capped to what the cluster can provide via `process.resourceLimits` (64 CPUs, 1000 GB memory, 720 h). Requests above these are capped automatically. | ||
| - Unlabelled processes default to a safe 1 CPU / 6 GB / 10 min. | ||
| - The shared `/omics` filesystem is bound into every container via `containerOptions`, with `--nv` added for accelerator tasks. **If one of your modules sets its own `containerOptions`, re-add `--bind /omics` there.** | ||
|
|
||
| ## Enable GPU support | ||
|
|
||
| This profile turns any task that requests a GPU through Nextflow's standard [`accelerator` directive](https://www.nextflow.io/docs/latest/reference/process.html#accelerator) into a correct DKFZ GPU submission. It selects the GPU queue, builds the LSF `-gpu num=<n>:j_exclusive=yes[:gmem=<n>G]` request, and adds `--nv` so the GPU is visible inside the container. | ||
|
|
||
| How a task acquires an `accelerator` request depends on the pipeline: | ||
|
|
||
| - **nf-core pipelines** mark GPU-capable processes with the `process_gpu` label and only switch the accelerator on when the run includes the `gpu` profile. So add `gpu` to your profile list: | ||
|
|
||
| ```bash | ||
| nextflow run <pipeline> -profile dkfz,gpu,apptainer --input ... --outdir ... | ||
| ``` | ||
|
|
||
| - **Custom / non-nf-core pipelines** just declare `accelerator` on the GPU process: | ||
|
|
||
| ```nextflow | ||
| process MY_GPU_TASK { | ||
| accelerator 1 | ||
| container 'docker://nvcr.io/...' | ||
|
|
||
| script: | ||
| "my_gpu_tool ..." | ||
| } | ||
| ``` | ||
|
|
||
| ```bash | ||
| nextflow run main.nf -profile dkfz,apptainer --outdir ... | ||
| ``` | ||
|
|
||
| Tasks without an `accelerator` request are unaffected and run on the normal CPU queues. | ||
|
|
||
| ### Choosing the GPU queue | ||
|
|
||
| The `--dkfz_gpu_queue` parameter selects which GPU queue all GPU jobs are submitted to (default `gpu`): | ||
|
|
||
| - `gpu` — default (RTX 2080 Ti … V100/A100-DGX), 72 h wall time | ||
| - `gpu-lowprio` — same nodes as `gpu` but low priority; use for large job batches | ||
| - `gpu-pro` — high-end A100/H200/L40S/GH200, 142 h wall time — **requires a separate access application to the DKFZ Data Science Board** | ||
|
|
||
| ### Number of GPUs and GPU memory per process | ||
|
|
||
| The profile builds the LSF request as `-gpu num=<n>:j_exclusive=yes[:gmem=<n>G]` (DKFZ requires `j_exclusive=yes` and rejects `mode=exclusive_process`). Two things are tunable per process: | ||
|
|
||
| - **Number of GPUs** — the [`accelerator` directive](https://www.nextflow.io/docs/latest/reference/process.html#accelerator) (default 1). | ||
| - **GPU memory (optional)** — set `ext.gpu_memory` to a Nextflow memory value to pin the job to GPUs with at least that much VRAM. When `ext.gpu_memory` is unset, `gmem` is omitted and LSF assigns any free GPU. | ||
|
|
||
| Approximate values to target each GPU tier (request at or just below the card's usable VRAM): | ||
|
|
||
| | `ext.gpu_memory` | Targets | Queue | | ||
| | ---------------- | ------------------------------- | ------------------ | | ||
| | `10.GB` | RTX 2080 Ti (11 GB) | `gpu` | | ||
| | `15.GB` | V100 16 GB | `gpu` | | ||
| | `23.GB` | TITAN RTX / Quadro RTX (24 GB) | `gpu` | | ||
| | `31.GB` | V100 32 GB | `gpu` | | ||
| | `40.GB` | A100 40 GB | `gpu-pro` only | | ||
| | `46.GB` | L40S | `gpu-pro` only | | ||
| | `98.GB` | GH200 | `gpu-pro` only | | ||
| | `141.GB` | H200 | `gpu-pro` only | | ||
|
|
||
| Set these directly on the process, or per process name from config (e.g. nf-core's `conf/modules.config`): | ||
|
|
||
| ```nextflow | ||
| process { | ||
| // 2 GPUs, any free GPU (no gmem constraint) | ||
| withName: 'FOO:BAR:ALIGN_GPU' { | ||
| accelerator = 2 | ||
| } | ||
| // 1 big-memory GPU | ||
| withName: 'FOO:BAR:FOLD' { | ||
| accelerator = 1 | ||
| ext.gpu_memory = 40.GB // -> A100/L40S/H200; also set --dkfz_gpu_queue gpu-pro | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| > :warning: Requesting `40.GB` or more only works on `gpu-pro`. On the plain `gpu` queue such a request hangs in `PEND` forever. Use at most 12 CPUs and ~45 GB host RAM per GPU (DKFZ GPU usage policy). | ||
|
|
||
| ## Running Nextflow on the cluster | ||
|
|
||
| Run the Nextflow driver inside an LSF job rather than on a submission host directly. Make a script and submit it with `bsub < my_script.sh`: | ||
|
|
||
| ```bash | ||
| module load Nextflow/21.04.0 | ||
| #!/bin/bash | ||
| #BSUB -J nf_pipeline | ||
| #BSUB -o nf_pipeline.%J.log | ||
| #BSUB -q long | ||
| #BSUB -n 2 | ||
| #BSUB -R "rusage[mem=8G]" | ||
| #BSUB -W 10:00 | ||
|
|
||
| module load Nextflow/25.10.0 | ||
|
|
||
| # Cache images on shared storage so worker nodes can reach them: | ||
| export NXF_APPTAINER_CACHEDIR=/omics/groups/<your-group>/.../apptainer_cache | ||
|
|
||
| nextflow run <pipeline> \ | ||
| -profile dkfz,apptainer \ | ||
| --input samplesheet.csv \ | ||
| --outdir results | ||
| ``` | ||
|
|
||
| > Note: All of the intermediate files required to run the pipeline will be stored in the work/ directory. It is recommended to delete this directory after the pipeline has finished successfully because it can get quite large, and all of the main output files will be saved in the results/ directory anyway. | ||
| Add `gpu` to `-profile` (e.g. `-profile dkfz,gpu,apptainer`) to send `process_gpu` tasks to a GPU queue. | ||
|
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.