|
| 1 | +# Slurm Appliance Sequences |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +## Image build |
| 6 | + |
| 7 | +This sequence applies to both: |
| 8 | +- "fatimage" builds, starting from GenericCloud images and using |
| 9 | + control,login,compute inventory groups to install all packages, e.g. StackHPC |
| 10 | + CI builds |
| 11 | +- "extra" builds, starting from StackHPC images and using selected inventory |
| 12 | + groups to add specfic features for a site-specific image. |
| 13 | + |
| 14 | +Note that a generic Pulp server is shown in the below diagram. This may be |
| 15 | +StackHPC's Ark server or a local Pulp mirroring Ark. It is assumed a local Pulp |
| 16 | +has already had the relevant snapshots synced from Ark (although it is possible |
| 17 | +to trigger this during an image build). |
| 18 | + |
| 19 | +Note that ansible-init does not run during an image build. It is disabled via |
| 20 | +a metadata flag. |
| 21 | + |
| 22 | +```mermaid |
| 23 | +sequenceDiagram |
| 24 | + participant ansible as Ansible Deploy Host |
| 25 | + participant cloud as Cloud |
| 26 | + note over ansible: $ packer build ... |
| 27 | + ansible->>cloud: Create VM |
| 28 | + create participant packer as Build VM |
| 29 | + participant pulp as Pulp |
| 30 | + cloud->>packer: Create VM |
| 31 | + note over packer: Boot |
| 32 | + rect rgb(204, 232, 252) |
| 33 | + note right of packer: ansible-init |
| 34 | + packer->>cloud: Query metadata |
| 35 | + cloud->>packer: Metadata sent |
| 36 | + packer->>packer: Skip ansible-init |
| 37 | + end |
| 38 | + ansible->>packer: Wait for ssh connection |
| 39 | + rect rgb(204, 232, 252) |
| 40 | + note right of ansible: fatimage.yml |
| 41 | + ansible->>packer: Overwrite repo files with Pulp repos and update |
| 42 | + packer->>pulp: dnf update |
| 43 | + pulp-->>packer: Package updates |
| 44 | + ansible->>packer: Perform installation tasks |
| 45 | + ansible->>packer: Shutdown |
| 46 | + end |
| 47 | + ansible->>cloud: Create image from Build VM root disk |
| 48 | + destroy packer |
| 49 | + note over cloud: Image created |
| 50 | +``` |
| 51 | + |
| 52 | +## Cluster Creation |
| 53 | + |
| 54 | +In the below it is assumed that no additional packages are installed beyond |
| 55 | +what is present in the image, i.e. Ark/local Pulp access is not required. |
| 56 | + |
| 57 | +```mermaid |
| 58 | +sequenceDiagram |
| 59 | + participant ansible as Ansible Deploy Host |
| 60 | + participant cloud as Cloud |
| 61 | + rect rgb(204, 232, 252) |
| 62 | + note over ansible: $ ansible-playbook ansible/adhoc/generate-passwords.yml |
| 63 | + ansible->>ansible: Template secrets to inventory group_vars |
| 64 | + end |
| 65 | + rect rgb(204, 232, 252) |
| 66 | + note over ansible: $ tofu apply ... |
| 67 | + ansible->>cloud: Create infra |
| 68 | + create participant nodes as Cluster Instances |
| 69 | + cloud->>nodes: Create instances |
| 70 | + end |
| 71 | + note over nodes: Boot |
| 72 | + rect rgb(204, 232, 252) |
| 73 | + note right of nodes: ansible-init |
| 74 | + nodes->>cloud: Query metadata |
| 75 | + cloud->>nodes: Metadata sent |
| 76 | + end |
| 77 | + rect rgb(204, 232, 252) |
| 78 | + note over ansible: $ ansible-playbook ansible/site.yml |
| 79 | + ansible->>nodes: Wait for ansible-init completion |
| 80 | + ansible->>nodes: Ansible tasks |
| 81 | + note over nodes: All services running |
| 82 | + end |
| 83 | +``` |
| 84 | + |
| 85 | +## Slurm Controlled Rebuild |
| 86 | + |
| 87 | +This sequence applies to active clusters, after running the `site.yml` playbook |
| 88 | +for the first time. Slurm controlled rebuild requires that: |
| 89 | +- Compute groups in the OpenTofu `compute` variable have: |
| 90 | + - `ignore_image_changes: true` |
| 91 | + - `compute_init_enable: ['compute', ... ]` |
| 92 | +- The Ansible `rebuild` inventory group contains the `control` group. |
| 93 | + |
| 94 | +TODO: should also document how compute-init does NOT run if the `site.yml` |
| 95 | +playbook has not been run. |
| 96 | + |
| 97 | +```mermaid |
| 98 | +sequenceDiagram |
| 99 | + participant ansible as Ansible Deploy Host |
| 100 | + participant cloud as Cloud |
| 101 | + participant nodes as Cluster Instances |
| 102 | + note over ansible: Update OpenTofu cluster_image variable [1] |
| 103 | + rect rgb(204, 232, 250) |
| 104 | + note over ansible: $ tofu apply .... |
| 105 | + ansible<<->>cloud: Check login/compute current vs desired images |
| 106 | + cloud->>nodes: Reimage login and control nodes |
| 107 | + ansible->>ansible: Update inventory/hosts.yml for<br>compute node image_id |
| 108 | + end |
| 109 | + rect rgb(204, 232, 250) |
| 110 | + note over ansible: $ ansible-playbook ansible/site.yml |
| 111 | + ansible->>nodes: Hostvars templated to nfs share |
| 112 | + ansible->>nodes: Ansible tasks |
| 113 | + note over nodes:All services running |
| 114 | + end |
| 115 | + note over nodes: $ srun --reboot ... |
| 116 | + rect rgb(204, 232, 250) |
| 117 | + note over nodes: RebootProgram [2] |
| 118 | + nodes->>cloud: Compare current instance image to target from hostvars |
| 119 | + cloud->>nodes: Reimage if target != current |
| 120 | + rect rgb(252, 200, 100) |
| 121 | + note over nodes: compute-init [3] |
| 122 | + nodes->>nodes: Retrieve hostvars from nfs mount |
| 123 | + nodes->>nodes: Run ansible tasks |
| 124 | + note over nodes: Compute nodes rejoin cluster |
| 125 | + end |
| 126 | + end |
| 127 | + nodes->>nodes: srun task completes |
| 128 | +``` |
| 129 | +Notes: |
| 130 | +1. And/or login/compute group overrides |
| 131 | +2. Running on control node |
| 132 | +3. On hosts targeted by job |
| 133 | + |
0 commit comments