-
Notifications
You must be signed in to change notification settings - Fork 34
Add docs with sequence diagrams for operations #456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
23d55c7
add sequence docs
sjpb 01dcdba
Update docs/sequence.md
sjpb 5e6f477
Update docs/sequence.md
sjpb b46e9f0
Update docs/sequence.md
sjpb 4a664a4
Added release train to sequence diagram
wtripp180901 eb7be5c
Merge branch 'main' into docs/sequences
bertiethorpe c8002b9
Add sequence for slurm controlled rebuild
bertiethorpe cb352c4
Merge branch 'main' into docs/sequences
bertiethorpe 100fdc8
Update sequence.md
bertiethorpe 83948b1
Update sequence.md
bertiethorpe ee201d4
Update sequence.md
bertiethorpe 8c1e960
Merge branch 'main' into docs/sequences
sjpb 58d79a7
make Pulp generic for build
sjpb 435d084
update for pulp, k3s, add boxes for clarity
sjpb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# Slurm Appliance Sequences | ||
|
||
|
||
|
||
## Image build | ||
|
||
This sequence applies to both: | ||
- "fatimage" builds, starting from GenericCloud images and using | ||
control,login,compute inventory groups to install all packages, e.g. StackHPC | ||
CI builds | ||
- "extra" builds, starting from StackHPC images and using selected inventory | ||
groups to add specfic features for a site-specific image. | ||
|
||
Note that a generic Pulp server is shown in the below diagram. This may be | ||
StackHPC's Ark server or a local Pulp mirroring Ark. It is assumed a local Pulp | ||
has already had the relevant snapshots synced from Ark (although it is possible | ||
to trigger this during an image build). | ||
|
||
Note that ansible-init does not run during an image build. It is disabled via | ||
a metadata flag. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant ansible as Ansible Deploy Host | ||
participant cloud as Cloud | ||
note over ansible: $ packer build ... | ||
ansible->>cloud: Create VM | ||
create participant packer as Build VM | ||
participant pulp as Pulp | ||
cloud->>packer: Create VM | ||
note over packer: Boot | ||
rect rgb(204, 232, 252) | ||
note right of packer: ansible-init | ||
packer->>cloud: Query metadata | ||
cloud->>packer: Metadata sent | ||
packer->>packer: Skip ansible-init | ||
end | ||
ansible->>packer: Wait for ssh connection | ||
rect rgb(204, 232, 252) | ||
note right of ansible: fatimage.yml | ||
ansible->>packer: Overwrite repo files with Pulp repos and update | ||
packer->>pulp: dnf update | ||
pulp-->>packer: Package updates | ||
ansible->>packer: Perform installation tasks | ||
ansible->>packer: Shutdown | ||
end | ||
ansible->>cloud: Create image from Build VM root disk | ||
destroy packer | ||
note over cloud: Image created | ||
``` | ||
|
||
## Cluster Creation | ||
|
||
In the below it is assumed that no additional packages are installed beyond | ||
what is present in the image, i.e. Ark/local Pulp access is not required. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant ansible as Ansible Deploy Host | ||
participant cloud as Cloud | ||
rect rgb(204, 232, 252) | ||
note over ansible: $ ansible-playbook ansible/adhoc/generate-passwords.yml | ||
ansible->>ansible: Template secrets to inventory group_vars | ||
sjpb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
end | ||
rect rgb(204, 232, 252) | ||
note over ansible: $ tofu apply ... | ||
ansible->>cloud: Create infra | ||
create participant nodes as Cluster Instances | ||
cloud->>nodes: Create instances | ||
end | ||
note over nodes: Boot | ||
rect rgb(204, 232, 252) | ||
note right of nodes: ansible-init | ||
nodes->>cloud: Query metadata | ||
cloud->>nodes: Metadata sent | ||
end | ||
rect rgb(204, 232, 252) | ||
note over ansible: $ ansible-playbook ansible/site.yml | ||
ansible->>nodes: Wait for ansible-init completion | ||
ansible->>nodes: Ansible tasks | ||
note over nodes: All services running | ||
end | ||
``` | ||
|
||
## Slurm Controlled Rebuild | ||
|
||
This sequence applies to active clusters, after running the `site.yml` playbook | ||
for the first time. Slurm controlled rebuild requires that: | ||
- Compute groups in the OpenTofu `compute` variable have: | ||
- `ignore_image_changes: true` | ||
- `compute_init_enable: ['compute', ... ]` | ||
- The Ansible `rebuild` inventory group contains the `control` group. | ||
|
||
TODO: should also document how compute-init does NOT run if the `site.yml` | ||
playbook has not been run. | ||
|
||
```mermaid | ||
sequenceDiagram | ||
participant ansible as Ansible Deploy Host | ||
participant cloud as Cloud | ||
participant nodes as Cluster Instances | ||
note over ansible: Update OpenTofu cluster_image variable [1] | ||
rect rgb(204, 232, 250) | ||
note over ansible: $ tofu apply .... | ||
ansible<<->>cloud: Check login/compute current vs desired images | ||
cloud->>nodes: Reimage login and control nodes | ||
ansible->>ansible: Update inventory/hosts.yml for<br>compute node image_id | ||
end | ||
rect rgb(204, 232, 250) | ||
note over ansible: $ ansible-playbook ansible/site.yml | ||
ansible->>nodes: Hostvars templated to nfs share | ||
ansible->>nodes: Ansible tasks | ||
note over nodes:All services running | ||
end | ||
note over nodes: $ srun --reboot ... | ||
rect rgb(204, 232, 250) | ||
note over nodes: RebootProgram [2] | ||
nodes->>cloud: Compare current instance image to target from hostvars | ||
cloud->>nodes: Reimage if target != current | ||
rect rgb(252, 200, 100) | ||
note over nodes: compute-init [3] | ||
nodes->>nodes: Retrieve hostvars from nfs mount | ||
nodes->>nodes: Run ansible tasks | ||
note over nodes: Compute nodes rejoin cluster | ||
end | ||
end | ||
nodes->>nodes: srun task completes | ||
``` | ||
Notes: | ||
1. And/or login/compute group overrides | ||
2. Running on control node | ||
3. On hosts targeted by job | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.