Skip to content

Commit 2b1fe60

Browse files
committed
more docs
1 parent 3f6fd08 commit 2b1fe60

File tree

2 files changed

+228
-22
lines changed

2 files changed

+228
-22
lines changed

docs/configuration.md

+226-21
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,29 @@ container engine:
44
- Apptainer >= 1.2
55
- SingularityCE >= 3.11
66

7+
To detect which of these is available on your HPC, execute `apptainer --version` or
8+
`singularity --version` in a shell on a login or compute node. Note that on some systems, the container runtime is packaged in a
9+
module in which case you would first have to load it before it becomes available in your
10+
shell. Check your HPC's documentation for more information.
711
If none of these are available, contact your system administrators or set up psiflow
8-
manually. Otherwise, proceed with the following steps.
12+
[manually](#manual-setup). Otherwise, proceed with the following steps.
13+
14+
We provide two versions of essentially the same container; one for Nvidia GPUs (based on a
15+
PyTorch wheel for CUDA 11.8) and one for AMD GPUs (based on a PyTorch wheel for ROCm 5.6).
16+
These images are hosted on the Github Container Registry (abbreviated by `ghcr`) and can
17+
be directly downloaded and cached by the container runtime.
18+
For example, if we wish to execute a simple command `ls` using the container image for
19+
Nvidia GPUs, we would write:
20+
```bash
21+
apptainer exec oras://ghcr/io/molmod/psiflow:main_cu118 ls
22+
```
23+
We use `psiflow:main_cu118` to get the image which was built from the latest `main` branch
24+
of the psiflow repository, for CUDA 11.8.
25+
Similarly, for AMD GPUs and, for example, psiflow v4.0.0-rc0, we would use
26+
```bash
27+
apptainer exec oras://ghcr.io/molmod/psiflow:4.0.0-rc0_rocm5.6 ls
28+
```
29+
See the [Apptainer](https://apptainer.org/docs/user/latest/)/[SingularityCE](https://docs.sylabs.io/guides/4.1/user-guide/) documentation for more information.
930

1031
## Python environment
1132
The main Python script which defines the workflow requires a Python 3.10 / 3.11 environment
@@ -35,14 +56,22 @@ with a recent version of `pip` and [`ndcctools`](https://github.com/cooperative-
3556
pip install git+https://github.com/molmod/[email protected]
3657
```
3758
_Everything else_ -- i-PI, CP2K, GPAW, Weights & Biases, PLUMED, ... -- is handled by
38-
the container image and hence need not be installed manually.
59+
the container images and hence need not be installed manually.
3960

4061
- **(with `virtualenv`/`venv`)**: create a new environment and install psiflow from github
41-
using the same command as above. In addition, you will have to set up the `cctools`
62+
using the same command as above. In addition, you will have to compile and install the `cctools`
4263
package manually. See the
4364
[documentation](https://cctools.readthedocs.io/en/stable/install/) for the appropriate
4465
instructions.
4566

67+
Verify the correctness of your environment using the following commands:
68+
69+
```bash
70+
python -c 'import psiflow' # python import should work
71+
which work_queue_worker # tests whether ndcctools is available and on PATH
72+
73+
```
74+
4675

4776
## Execution
4877
Psiflow scripts are executed as a simple Python process.
@@ -51,18 +80,54 @@ execute the calculations asynchronously and as fast as possible.
5180
To achieve this, it automatically requests the compute resources it needs during
5281
execution.
5382

54-
To make this work, it is necessary to define precisely (i) how each elementary calculation
55-
(model training, CP2K singlepoint evaluation, molecular dynamics) should proceed, and (ii)
83+
To make this work, it is necessary to define precisely how ML potential training, molecular
84+
dynamics, and QM calculations should proceed, and (ii)
5685
how the required resources for those calculations should be obtained.
5786
These additional parameters are to be specified in a separate 'configuration' `.yaml` file, which is
5887
passed into the main Python workflow script as an argument.
59-
In what follows, we provide an exhaustive list of all execution-side parameters along with
60-
a few examples. We suggest to go through Parsl's [documentation on
88+
The configuration file has a specific structure which is explained in the following
89+
sections. In many cases, you will be able to start from one of the [example
90+
configurations](https://github.com/molmod/psiflow/tree/main/configs)
91+
in the repository and adapt it for your cluster.
92+
We also suggest you to go through Parsl's [documentation on
6193
execution](https://parsl.readthedocs.io/en/stable/userguide/execution.html) first as this
6294
will improve your understanding of what follows.
6395

64-
### 1. model training
65-
This defines how `model.train` operations are performed. Since
96+
There are three types of calculations:
97+
98+
- **ML potential training** (`ModelTraining`)
99+
- **ML potential inference, i.e. molecular dynamics** (`ModelEvaluation`)
100+
- **QM calculations** (`CP2K`, `GPAW`, `ORCA`)
101+
102+
and the structure of a typical `config.yaml` consequently looks like this
103+
```yaml
104+
# top level options define the overall behavior
105+
# see below for a full list
106+
container_engine: <singularity or apptainer>
107+
container_uri: <link to container, i.e. oras://ghcr.io/...>
108+
109+
ModelTraining:
110+
# specifies how ML potential training should be performed
111+
# and which resources it needs to use
112+
113+
ModelEvaluation:
114+
# specifies how MD / geometry optimization / hamiltonian computations are performed
115+
# and which resources it needs to use
116+
117+
CP2K:
118+
# specifies how CP2K single points need to be performed
119+
120+
GPAW:
121+
# specifies how GPAW single points need to be performed
122+
123+
ORCA:
124+
# specifies how ORCA single points need to be performed
125+
126+
```
127+
128+
129+
### 1. ML potential training
130+
This defines how `model.train()` operations are performed. Since
66131
training is necessarily performed on a GPU, it is necessary to specify resources in
67132
which a GPU is available. Consider the following simple training example
68133
```py
@@ -94,11 +159,16 @@ Then we can execute the script using the following command:
94159
python train.py config.yaml
95160
```
96161
The `config.yaml` file should define how and where the model should be trained and
97-
evaluated. Internally, Parsl will use that information to construct the appropriate
162+
evaluated.
163+
164+
Next, we define how model training should be performed.
165+
Internally, Parsl will use that information to construct the appropriate
98166
SLURM jobscripts, send them to the scheduler, and once the resources are allocated,
99167
start the calculation. For example, assume that the GPU partition on this cluster is
100168
named `infinite_a100`, and it has 12 cores per GPU. Consider the following config
101169
```yaml
170+
container_runtime: apptainer # or singularity; check HPC docs to see which one is available
171+
container_uri: oras://ghcr.io/molmod/psiflow:main_cu118 # built from github main branch
102172
ModelTraining:
103173
cores_per_worker: 12
104174
gpu: true
@@ -112,7 +182,7 @@ ModelTraining:
112182
scheduler_options: "#SBATCH --gpus=2"
113183
```
114184
The top-level keyword `ModelTraining` indicates that we're defining the execution of
115-
`model.train`. It has a number of special keywords:
185+
`model.train()`. It has a number of special keywords:
116186

117187
- **cores_per_worker** (int): number of CPUs per GPU.
118188
- **gpu** (bool): whether to use GPU(s) -- should almost always be true for training.
@@ -148,7 +218,7 @@ There exist a few additional keywords for `ModelTraining` which might be useful:
148218
OMP_PROC_BIND: spread
149219
```
150220

151-
### 2. model evaluation
221+
### 2. molecular dynamics
152222
Consider the following example:
153223
```py
154224
import psiflow
@@ -161,14 +231,9 @@ def main():
161231
mace = MACEHamiltonian.mace_mp0()
162232
start = Geometry.load('start.xyz')
163233
164-
walkers = []
165-
for i in range(8):
166-
walker = Walker(mace, temperature=300)
167-
168-
replica_exchange(walkers[:4], trial_frequency=100)
169-
walkers[-1].nbeads = 8
234+
walkers = Walker(mace, temperature=300).multiply(8)
170235
171-
outputs = sample(walkers, steps=1000, step=10)
236+
outputs = sample(walkers, steps=int(1e9), step=10) # extremely long
172237
for i, output in enumerate(outputs):
173238
output.trajectory.save(f'{i}.xyz')
174239
@@ -179,5 +244,145 @@ if __name__ == '__main__':
179244
180245
```
181246
In this example, we use MACE-MP0 to run 8 molecular dynamics simulations in the NVT
182-
ensemble. The first four walkers are coupled with replica exchange moves, and the last
183-
walker is set to use PIMD with 8 replicas (or beads).
247+
ensemble. Since they are all independent from each other, psiflow will attempt to execute
248+
them in parallel as much as possible.
249+
The configuration section which deals with ML potential inference, including molecular
250+
dynamics but also geometry optimization and `hamiltonian.compute()` calls, is named
251+
`ModelEvaluation`:
252+
253+
254+
```yaml
255+
ModelEvaluation:
256+
cores_per_worker: 12
257+
gpu: true
258+
slurm:
259+
partition: "infinite_a100"
260+
account: "112358"
261+
nodes_per_block: 2
262+
cores_per_node: 48 # full node; sometimes granted faster than partials
263+
max_blocks: 1
264+
walltime: "01:00:00" # small to try and skip the queue
265+
scheduler_options: "#SBATCH --gpus=4"
266+
```
267+
It is in general quite similar to `ModelTraining`. Because in general, psiflow workflows
268+
contain a large number of molecular dynamics simulations, it makes sense to ask for larger
269+
allocations for each block (= SLURM job). In this example, we immediately ask for two full
270+
GPU nodes, with four GPUs each. This is exactly the amount we need to execute all eight
271+
molecular dynamics simulations in parallel, without wasting any resources.
272+
As such, when we execute the above example using `python script.py config.yaml`, Parsl
273+
will recognize that we need resources for eight simulations, ask for precisely one allocation
274+
according to the above parameters, and start all eight simulations simultaneously.
275+
276+
Of course, we greatly overestimate the number of steps we wish to simulate.
277+
The SLURM allocation has a walltime of one hour, which means that if a simulation does not
278+
finish in 12 hours, it will be gracefully terminated and the saved trajectories will only
279+
cover a fraction of the requested one billion steps.
280+
Psiflow will not automatically continue the simulations on a new SLURM allocation.
281+
282+
The available keywords in the `ModelEvaluation` section are the same as for
283+
`ModelTraining`, except for one:
284+
285+
- **max_simulation_time** (float, in minutes):
286+
287+
### 3. QM calculations
288+
Finally, we need to specify how QM calculations are performed.
289+
By default, these calculations are not executed within the container image provided by
290+
`container_uri` at the top level.
291+
Users can choose to rely on their system-installed QM software or employ one of the
292+
smaller and specialized container images for CP2K or GPAW. We will discuss both cases
293+
below.
294+
295+
First, assume we wish to use a system-installed CP2K module, and execute each singlepoint
296+
on 32 cores. Assume that the nodes in our cpu partition possess 128 cores:
297+
```yaml
298+
CP2K:
299+
cores_per_worker: 32
300+
max_evaluation_time: 30 # kill calculation after 30 mins; SCF unconverged
301+
launch_command: "OMP_NUM_THREADS=1 mpirun -np 32 cp2k.psmp" # force 1 thread/rank
302+
slurm:
303+
partition: "infinite_CPU"
304+
account: "112358"
305+
nodes_per_block: 16
306+
cores_per_node: 128
307+
max_blocks: 1
308+
walltime: "12:00:00"
309+
worker_init: "ml CP2K/2024.1" # activate CP2K module in jobscript!
310+
```
311+
We asked for a big allocation of 16 nodes, each with 128 cores. On each node, psiflow can
312+
concurrently execute four singlepoints, since we specified `cores_per_worker: 32`.
313+
314+
Consider now the following script:
315+
```py
316+
import psiflow
317+
from psiflow.data import Dataset
318+
from psiflow.reference import CP2K
319+
320+
321+
def main():
322+
unlabeled = Dataset.load('long_trajectory.xyz')
323+
324+
with open('cp2k_input.txt', 'r') as f:
325+
cp2k_input = f.read()
326+
cp2k = CP2K(cp2k_input)
327+
328+
labeled = unlabeled.evaluate(cp2k)
329+
labeled.save('labeled.xyz')
330+
331+
332+
if __name__ == '__main__':
333+
with psiflow.load():
334+
main()
335+
336+
```
337+
Assume `long_trajectory.xyz` is a large XYZ file with, say, 1,000 snapshots.
338+
In the above script, we simply load the data, evaluate the energy and forces of each
339+
snapshot with CP2K, and save the result as (ext)XYZ.
340+
Again, we execute this script by running `python script.py config.yaml` within a Python
341+
environment with psiflow and cctools available.
342+
Even though all of these calculations can proceed in parallel, we specified `max_blocks:
343+
1` to not overload our resource usage.
344+
As such, Parsl will request precisely one block/allocation of 16 nodes, and start
345+
executing the singlepoint QM evaluations.
346+
At any given moment, there will be (16 nodes x 4 calculations/node = ) 64 calculations
347+
running.
348+
349+
Now assume our system administrators did not provide us with the latest and greatest
350+
version of CP2K.
351+
The installation process is quite long and tedious (even via tools like EasyBuild or Spack),
352+
which is why psiflow provides **small containers which only contain the QM software**.
353+
They are separate from the psiflow containers mentioned before in order to improve
354+
modularity and reduce individual container sizes.
355+
At the moment, such containers are available for CP2K 2024.1 and GPAW 24.1.
356+
To use them, it suffices to wrap the launch command inside an `apptainer` or `singularity`
357+
invocation, whichever is available on your system:
358+
359+
```yaml
360+
CP2K:
361+
cores_per_worker: 32
362+
max_evaluation_time: 30 # kill calculation after 30 mins; SCF unconverged
363+
launch_command: "apptainer exec -e --no-init oras://ghcr.io/molmod/cp2k:2024.1 /opt/entry.sh mpirun -np 32 cp2k.psmp"
364+
slurm:
365+
partition: "infinite_CPU"
366+
account: "112358"
367+
nodes_per_block: 16
368+
cores_per_node: 128
369+
max_blocks: 1
370+
walltime: "12:00:00"
371+
# no more need for module load commands!
372+
```
373+
The command is quite long but normally self-explanatory if you're somewhat familiar with
374+
containers.
375+
376+
## SLURM quickstart
377+
378+
Psiflow contains a small script which detects the available SLURM partitions and their
379+
hardware and creates a minimal, initial `config.yaml` which you can use as a starting point
380+
to further tune to your liking. To use it, simply activate your psiflow Python environment
381+
and execute the following command:
382+
383+
```sh
384+
python -c 'import psiflow; psiflow.setup_slurm_config()'
385+
```
386+
387+
## manual setup
388+
TODO

docs/index.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@ It supports:
1212
- **quantum mechanical calculations** at various levels of theory (GGA and hybrid DFT, post-HF methods such as MP2 or RPA, and even coupled cluster; using CP2K | GPAW | ORCA)
1313

1414
- **trainable interaction potentials** as well as easy-to-use universal potentials, e.g. [MACE-MP0](https://arxiv.org/abs/2401.00096)
15-
- a wide range of **sampling algorithms**: NVE | NVT | NPT, path-integral molecular dynamics, alchemical replica exchange, metadynamics, phonon-based sampling, ... (thanks to [i-PI](https://ipi-code.org/))
15+
- a wide range of **sampling algorithms**: NVE | NVT | NPT, path-integral molecular dynamics, alchemical replica exchange, metadynamics, phonon-based sampling, thermodynamic integration; using [i-PI](https://ipi-code.org/),
16+
[PLUMED](https://www.plumed.org/), ...
1617

1718
Users may define arbitrarily complex workflows and execute them **automatically** on local, HPC, and/or cloud infrastructure.
1819
To achieve this, psiflow is built using [Parsl](https://parsl-project.org/): a parallel execution library which manages job submission and workload distribution.

0 commit comments

Comments
 (0)