Skip to content

Commit 92a73b7

Browse files
sjpbwtripp180901bertiethorpe
authored
Add docs with sequence diagrams for operations (#456)
* add sequence docs * Update docs/sequence.md Co-authored-by: wtripp180901 <[email protected]> * Update docs/sequence.md Co-authored-by: wtripp180901 <[email protected]> * Update docs/sequence.md Co-authored-by: wtripp180901 <[email protected]> * Added release train to sequence diagram * Add sequence for slurm controlled rebuild * Update sequence.md * Update sequence.md * Update sequence.md * make Pulp generic for build * update for pulp, k3s, add boxes for clarity --------- Co-authored-by: wtripp180901 <[email protected]> Co-authored-by: bertiethorpe <[email protected]>
1 parent b07ecb0 commit 92a73b7

File tree

1 file changed

+133
-0
lines changed

1 file changed

+133
-0
lines changed

docs/sequence.md

+133
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Slurm Appliance Sequences
2+
3+
4+
5+
## Image build
6+
7+
This sequence applies to both:
8+
- "fatimage" builds, starting from GenericCloud images and using
9+
control,login,compute inventory groups to install all packages, e.g. StackHPC
10+
CI builds
11+
- "extra" builds, starting from StackHPC images and using selected inventory
12+
groups to add specfic features for a site-specific image.
13+
14+
Note that a generic Pulp server is shown in the below diagram. This may be
15+
StackHPC's Ark server or a local Pulp mirroring Ark. It is assumed a local Pulp
16+
has already had the relevant snapshots synced from Ark (although it is possible
17+
to trigger this during an image build).
18+
19+
Note that ansible-init does not run during an image build. It is disabled via
20+
a metadata flag.
21+
22+
```mermaid
23+
sequenceDiagram
24+
participant ansible as Ansible Deploy Host
25+
participant cloud as Cloud
26+
note over ansible: $ packer build ...
27+
ansible->>cloud: Create VM
28+
create participant packer as Build VM
29+
participant pulp as Pulp
30+
cloud->>packer: Create VM
31+
note over packer: Boot
32+
rect rgb(204, 232, 252)
33+
note right of packer: ansible-init
34+
packer->>cloud: Query metadata
35+
cloud->>packer: Metadata sent
36+
packer->>packer: Skip ansible-init
37+
end
38+
ansible->>packer: Wait for ssh connection
39+
rect rgb(204, 232, 252)
40+
note right of ansible: fatimage.yml
41+
ansible->>packer: Overwrite repo files with Pulp repos and update
42+
packer->>pulp: dnf update
43+
pulp-->>packer: Package updates
44+
ansible->>packer: Perform installation tasks
45+
ansible->>packer: Shutdown
46+
end
47+
ansible->>cloud: Create image from Build VM root disk
48+
destroy packer
49+
note over cloud: Image created
50+
```
51+
52+
## Cluster Creation
53+
54+
In the below it is assumed that no additional packages are installed beyond
55+
what is present in the image, i.e. Ark/local Pulp access is not required.
56+
57+
```mermaid
58+
sequenceDiagram
59+
participant ansible as Ansible Deploy Host
60+
participant cloud as Cloud
61+
rect rgb(204, 232, 252)
62+
note over ansible: $ ansible-playbook ansible/adhoc/generate-passwords.yml
63+
ansible->>ansible: Template secrets to inventory group_vars
64+
end
65+
rect rgb(204, 232, 252)
66+
note over ansible: $ tofu apply ...
67+
ansible->>cloud: Create infra
68+
create participant nodes as Cluster Instances
69+
cloud->>nodes: Create instances
70+
end
71+
note over nodes: Boot
72+
rect rgb(204, 232, 252)
73+
note right of nodes: ansible-init
74+
nodes->>cloud: Query metadata
75+
cloud->>nodes: Metadata sent
76+
end
77+
rect rgb(204, 232, 252)
78+
note over ansible: $ ansible-playbook ansible/site.yml
79+
ansible->>nodes: Wait for ansible-init completion
80+
ansible->>nodes: Ansible tasks
81+
note over nodes: All services running
82+
end
83+
```
84+
85+
## Slurm Controlled Rebuild
86+
87+
This sequence applies to active clusters, after running the `site.yml` playbook
88+
for the first time. Slurm controlled rebuild requires that:
89+
- Compute groups in the OpenTofu `compute` variable have:
90+
- `ignore_image_changes: true`
91+
- `compute_init_enable: ['compute', ... ]`
92+
- The Ansible `rebuild` inventory group contains the `control` group.
93+
94+
TODO: should also document how compute-init does NOT run if the `site.yml`
95+
playbook has not been run.
96+
97+
```mermaid
98+
sequenceDiagram
99+
participant ansible as Ansible Deploy Host
100+
participant cloud as Cloud
101+
participant nodes as Cluster Instances
102+
note over ansible: Update OpenTofu cluster_image variable [1]
103+
rect rgb(204, 232, 250)
104+
note over ansible: $ tofu apply ....
105+
ansible<<->>cloud: Check login/compute current vs desired images
106+
cloud->>nodes: Reimage login and control nodes
107+
ansible->>ansible: Update inventory/hosts.yml for<br>compute node image_id
108+
end
109+
rect rgb(204, 232, 250)
110+
note over ansible: $ ansible-playbook ansible/site.yml
111+
ansible->>nodes: Hostvars templated to nfs share
112+
ansible->>nodes: Ansible tasks
113+
note over nodes:All services running
114+
end
115+
note over nodes: $ srun --reboot ...
116+
rect rgb(204, 232, 250)
117+
note over nodes: RebootProgram [2]
118+
nodes->>cloud: Compare current instance image to target from hostvars
119+
cloud->>nodes: Reimage if target != current
120+
rect rgb(252, 200, 100)
121+
note over nodes: compute-init [3]
122+
nodes->>nodes: Retrieve hostvars from nfs mount
123+
nodes->>nodes: Run ansible tasks
124+
note over nodes: Compute nodes rejoin cluster
125+
end
126+
end
127+
nodes->>nodes: srun task completes
128+
```
129+
Notes:
130+
1. And/or login/compute group overrides
131+
2. Running on control node
132+
3. On hosts targeted by job
133+

0 commit comments

Comments
 (0)