Skip to content

Commit 781c2d4

Browse files
sjpbsd109
andauthored
Add more information re. configuring production sites (#508)
* add lots of info to production docs * Production docs tweaks from review Co-authored-by: Scott Davidson <[email protected]> * add prod docs comment re login FIPs --------- Co-authored-by: Scott Davidson <[email protected]>
1 parent 50fc320 commit 781c2d4

File tree

1 file changed

+145
-5
lines changed

1 file changed

+145
-5
lines changed

docs/production.md

+145-5
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,149 @@
11
# Production Deployments
22

3-
This page contains some brief notes about differences between the default/demo configuration, as described in the main [README.md](../README.md) and production-ready deployments.
3+
This page contains some brief notes about differences between the default/demo
4+
configuration (as described in the main [README.md](../README.md)) and
5+
production-ready deployments.
6+
7+
- Get it agreed up front what the cluster names will be. Changing this later
8+
requires instance deletion/recreation.
9+
10+
- At least three environments should be created:
11+
- `site`: site-specific base environment
12+
- `production`: production environment
13+
- `staging`: staging environment
14+
15+
A `dev` environment should also be created if considered required, or this
16+
can be left until later.,
17+
18+
These can all be produced using the cookicutter instructions, but the
19+
`production` and `staging` environments will need their
20+
`environments/$ENV/ansible.cfg` file modifying so that they point to the
21+
`site` environment:
22+
23+
```ini
24+
inventory = ../common/inventory,../site/inventory,inventory
25+
```
26+
27+
- To avoid divergence of configuration all possible overrides for group/role
28+
vars should be placed in `environments/site/inventory/group_vars/all/*.yml`
29+
unless the value really is environment-specific (e.g. DNS names for
30+
`openondemand_servername`).
31+
32+
- Where possible hooks should also be placed in `environments/site/hooks/`
33+
and referenced from the `site` and `production` environments, e.g.:
34+
35+
```yaml
36+
# environments/production/hooks/pre.yml:
37+
- name: Import parent hook
38+
import_playbook: "{{ lookup('env', 'APPLIANCES_ENVIRONMENT_ROOT') }}/../site/hooks/pre.yml"
39+
```
40+
41+
- OpenTofu configurations should be defined in the `site` environment and used
42+
as a module from the other environments. This can be done with the
43+
cookie-cutter generated configurations:
44+
- Delete the *contents* of the cookie-cutter generated `terraform/` directories
45+
from the `production` and `staging` environments.
46+
- Create a `main.tf` in those directories which uses `site/terraform/` as a
47+
[module](https://opentofu.org/docs/language/modules/), e.g. :
48+
49+
```
50+
...
51+
module "cluster" {
52+
source = "../../site/terraform/"
53+
54+
cluster_name = "foo"
55+
...
56+
}
57+
```
58+
59+
Note that:
60+
- Environment-specific variables (`cluster_name`) should be hardcoded
61+
into the module block.
62+
- Environment-independent variables (e.g. maybe `cluster_net` if the
63+
same is used for staging and production) should be set as *defaults*
64+
in `environments/site/terraform/variables.tf`, and then don't need to
65+
be passed in to the module.
66+
67+
- Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates
68+
a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`.
69+
To ensure staging environments are a good model for production this should
70+
generally be moved into the `site` environment. It should be be encrypted
71+
using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html)
72+
and then committed to the repository.
73+
74+
- Ensure created instances have accurate/synchronised time. For VM instances
75+
this is usually provided by the hypervisor, but if not (or for bare metal
76+
instances) it may be necessary to configure or proxy `chronyd` via an
77+
environment hook.
78+
79+
- The cookiecutter provided OpenTofu configurations define resources for home and
80+
state volumes. The former may not be required if the cluster's `/home` is
81+
provided from an external filesystem (or Manila). In any case, in at least
82+
the production environment, and probably also in the staging environment,
83+
the volumes should be manually created and the resources changed to [data
84+
resources](https://opentofu.org/docs/language/data-sources/). This ensures that even if the cluster is deleted via tofu, the
85+
volumes will persist.
86+
87+
For a development environment, having volumes under tofu control via volume
88+
resources is usually appropriate as there may be many instantiations
89+
of this environment.
90+
91+
- Enable `etc_hosts` templating:
92+
93+
```yaml
94+
# environments/site/inventory/groups:
95+
[etc_hosts:children]
96+
cluster
97+
```
498

5-
- Create a site environment. Usually at least production, staging and possibly development environments are required. To avoid divergence of configuration these should all have an `inventory` path referencing a shared, site-specific base environment. Where possible hooks should also be placed in this site-specific environment.
6-
- Vault-encrypt secrets. Running the `generate-passwords.yml` playbook creates a secrets file at `environments/$ENV/inventory/group_vars/all/secrets.yml`. To ensure staging environments are a good model for production this should generally be moved into the site-specific environment. It can be be encrypted using [Ansible vault](https://docs.ansible.com/ansible/latest/user_guide/vault.html) and then committed to the repository.
7-
- Ensure created instances have accurate/synchronised time. For VM instances this is usually provided by the hypervisor, but if not (or for bare metal instances) it may be necessary to configure or proxy `chronyd` via an environment hook.
8-
- Remove production volumes from OpenTofu control. In the default OpenTofu configuration, deleting the resources also deletes the volumes used for persistent state and home directories. This is usually undesirable for production, so these resources should be removed from the OpenTofu configurations and manually deployed once. However note that for development environments leaving them under OpenTofu control is usually best.
999
- Configure Open OpenOndemand - see [specific documentation](openondemand.README.md).
100+
101+
- Modify `environments/site/terraform/nodes.tf` to provide fixed IPs for at least
102+
the control node, and (if not using FIPs) the login node(s):
103+
104+
```
105+
resource "openstack_networking_port_v2" "control" {
106+
...
107+
fixed_ip {
108+
subnet_id = data.openstack_networking_subnet_v2.cluster_subnet.id
109+
ip_address = var.control_ip_address
110+
}
111+
}
112+
```
113+
114+
Note the variable `control_ip_address` is new.
115+
116+
Using fixed IPs will require either using admin credentials or policy changes.
117+
118+
- If floating IPs are required for login nodes, modify the OpenTofu configurations
119+
appropriately.
120+
121+
- Enable persisting login node hostkeys so users do not get annoying ssh warning
122+
messages on reimage:
123+
124+
```yaml
125+
# environments/site/inventory/groups:
126+
[persist_hostkeys:children]
127+
login
128+
```
129+
And configure NFS to include exporting the state directory to these hosts:
130+
131+
```yaml
132+
# environments/common/inventory/group_vars/all/nfs.yml:
133+
nfs_configurations:
134+
# ... potentially, /home defintion from common environment
135+
- comment: Export state directory to login nodes
136+
nfs_enable:
137+
server: "{{ inventory_hostname in groups['control'] }}"
138+
clients: "{{ inventory_hostname in groups['login'] }}"
139+
nfs_server: "{{ nfs_server_default }}"
140+
nfs_export: "/var/lib/state"
141+
nfs_client_mnt_point: "/var/lib/state"
142+
```
143+
See [issue 506](https://github.com/stackhpc/ansible-slurm-appliance/issues/506).
144+
145+
- Consider whether mapping of baremetal nodes to ironic nodes is required. See
146+
[PR 485](https://github.com/stackhpc/ansible-slurm-appliance/pull/485).
147+
148+
- Note [PR 473](https://github.com/stackhpc/ansible-slurm-appliance/pull/473)
149+
may help identify any site-specific configuration.

0 commit comments

Comments
 (0)