Skip to content

Commit d565601

Browse files
authored
Add Exostellar Workload Optimizer support (#313)
Add new Xwo key to config. This will replace Xio in the future. For now you can use either, but not both. Check that certs installed Add XWO docs. ============================== Add Mac support Detect Mac OS nad use brew to install required packages. Updated to latest node.js and CDK on macos. Add slurmd parameter for l3 cache as socket to improve core affinity
1 parent 1a6a4fb commit d565601

30 files changed

+2545
-128
lines changed

cdk.context.json

-5
This file was deleted.

docs/exostellar-infrastructure-optimizer.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Exostellar Infrastructure Optimizer (XIO)
22

3-
[Exostellar Infrastructure Optimizer](https://exostellar.io/infrastructureoptimizer-technical-information/) (XIO) runs applications in virtual machines (VMs) on EC2 instances and can dynamically migrate the VMs between instances based on availability and cost.
3+
[Exostellar Infrastructure Optimizer](https://exostellar.io/product/#infrastructureoptimizer) (XIO) runs applications in virtual machines (VMs) on EC2 instances and can dynamically migrate the VMs between instances based on availability and cost.
44
Long-running, stateful jobs are not normally run on spot instances because of the risk of lost work after a spot termination.
55
XIO reduces this risk by predicting spot terminations and migrating the VM to another instance with higher availability.
66
This could be a different spot instance type or an on-demand instance.
@@ -678,13 +678,13 @@ Go to Stack Actions, select `Continue update rollback`, expand `Advanced trouble
678678

679679
### XIO Controller not starting
680680

681-
On EMA, check that a job is running to create the controller.
681+
On EMS, check that a job is running to create the controller.
682682

683683
`squeue`
684684

685685
On EMS, check the autoscaling log to see if there are errors starting the instance.
686686

687-
`less /var/log/slurm/autoscaling.log``
687+
`less /var/log/slurm/autoscaling.log`
688688

689689
EMS Slurm partions are at:
690690

docs/exostellar-workload-optimizer.md

+240
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,240 @@
1+
# Exostellar Workload Optimizer (XWO)
2+
3+
[Exostellar Workload Optimizer](https://exostellar.io/product/#workloadoptimizer) (XWO) runs applications in virtual machines (VMs) on EC2 instances and can dynamically migrate the VMs between instances based on actual memory utilization.
4+
This can provide significant savings when users over-provision memory or provision memory
5+
based on peak usage by running on instances with less memory when the extra memory isn't
6+
required.
7+
It also, optionally, provides the functionality of [Infrastructure Optimizer](https://exostellar.io/product/#infrastructureoptimizer) which migrates
8+
VMs between Spot and On-Demnad instances based on availability and cost.
9+
10+
XWO runs on an Exostellar Management Server (EMS).
11+
The EMS runs a web application and launches and manages the instances that run jobs.
12+
In response to job requests it launches controller nodes that manage pools of worker nodes.
13+
The controller launches workers and then starts one or more VMs on the workers.
14+
The controller also determines when VMs need to be migrated, allocates new workers, and manages the VM migrations.
15+
16+
XWO Profiles configure XWO Controllers and XWO Workers.
17+
The Workers run the XWO VMs.
18+
The Controller manages the workers and the VMs that run on them.
19+
The Worker configuration includes the instance types to use for
20+
on-demand and spot instances.
21+
It also includes the security groups and tags for the worker instances.
22+
23+
You create an XWO Application Environment for each Slurm cluster.
24+
The Application Environment contains the URL for the Slurm head node,
25+
configures pools of VMs,
26+
and configures the path to the Slurm binaries and configuration.
27+
The VM pools define the attributes of the instances including the number of CPUs, VM Image, min and max memory, and an associated XWO Profile.
28+
29+
You must also create XWO Images that are used to create the VMs.
30+
The Images are created from AWS AMIs and are specified in the Pools.
31+
32+
**NOTE:** One current restriction of XWO VMs is that they cannot be created from ParallelCluster AMIs.
33+
This is because the kernel modules that ParallelCluster installs aren't supported by the XWO hypervisor.
34+
35+
## XWO Configuration
36+
37+
This section will describe the process of configuring XWO to work with ParallelCluster.
38+
39+
Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/HPC-User/getting-started-installation) to make sure you have the latest instructions.
40+
41+
### Deploy ParallelCluster without configuring XWO
42+
43+
First deploy your cluster without configuring XWO.
44+
The cluster deploys ansible playbooks that will be used to create the XWO ParallelCluster AMI.
45+
46+
### Deploy the Exostellar Management Server (EMS)
47+
48+
The next step is to [install the Exostellar management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server).
49+
You must first subscribe to the three Exostellar Infrastructure AMIs in the AWS Marketplace.
50+
51+
* [Exostellar Management Server](https://aws.amazon.com/marketplace/server/procurement?productId=prod-crdnafbqnbnm2)
52+
* [Exostellar Controller](https://aws.amazon.com/marketplace/server/procurement?productId=prod-d4lifqwlw4kja)
53+
* [Exostellar Worker](https://aws.amazon.com/marketplace/server/procurement?productId=prod-2smeyk5fuxt7q)
54+
55+
Then follow the [directions to deploy the CloudFormation template](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server#v2.4.0.0InstallingwithCloudFormationTemplate(AWS)-Step3:CreateaNewStack).
56+
57+
### Verify that the "az1" profile exists
58+
59+
In the EMS GUI go to Profiles and make sure that the "az1" profile exists.
60+
I use that as a template to create your new profiles.
61+
62+
If it doesn't exist, there was a problem with the EMS deployment and you should contact Exostellar support.
63+
64+
### Create an XWO ParallelCluster AMI
65+
66+
Launch an instance using the base AMI for your OS.
67+
For example, launch an instance with a base RHEL 8 or Rocky 8 AMI.
68+
69+
Mount the ParallelCluster NFS file system at /opt/slurm.
70+
71+
Run the ansible playbook to configure the instance for XWO.
72+
73+
```
74+
/opt/slurm/config/bin/exostellar-compute-node-ami-configure.sh
75+
```
76+
77+
Do any additional configuration that you require such as configuring file system mounts and installing
78+
packages.
79+
80+
Create an AMI from the instance and wait for it to become available.
81+
82+
After the AMI has been successfully created you can either stop or terminate the instance to save costs.
83+
If you may need to do additional customization, then stop it, otherwise terminate it.
84+
85+
Add the image id to your configuration as described below.
86+
87+
### Create XWO Configuration
88+
89+
The next step is to plan and configure your XWO deployment.
90+
The key decisions that you must make are the instance types that you will use
91+
and the AMI that you will use for the XWO VM Images.
92+
93+
XWO currently only supports x86_64 instance types and pools cannot mix AMD and Intel instance types.
94+
The configuration has been simplified so that all you have to do is specify the instance types
95+
and families that you want to use.
96+
The instance types will be grouped by the number of cores and amount of memory to create
97+
pools and Slurm partitions.
98+
You can still create your own profiles and pools if the automatically generated ones do not
99+
meet your needs.
100+
The example only shows the simplified configuration.
101+
102+
**NOTE**: XWO currently doesn't support VMs larger than 1 TB.
103+
104+
Refer to [Best practices for Amazon EC2 Spot](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html) when planning your cluster deployment and creating your configuration.
105+
106+
It is highly recommended to use [EC2 Spot placement scores](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/work-with-spot-placement-score.html) when selecting the region and availability zone for your cluster.
107+
This will give you an indication of the likelihood of getting desired spot capacity.
108+
109+
In the following example,
110+
I've included instance families from the last 3 generations of instances to maximize the number of
111+
available capacity pools and increase the likelihood of running on spot.
112+
113+
**Note**: The Intel instance families contain more configurations and higher memory instances. They also have high frequency instance types such as m5zn, r7iz, and z1d. They also tend to have more capacity. The AMD instance families include HPC instance types, however, they do not support spot pricing and can only be used for on-demand.
114+
115+
**Note**: This is only an example configuration. You should customize it for your requirements.
116+
117+
```
118+
slurm:
119+
Xwo:
120+
ManagementServerStackName: exostellar-management-server
121+
PartitionName: xwo
122+
AvailabilityZone: us-east-1b
123+
124+
Images:
125+
- ImageId: ami-xxxxxxxxxxxxxxxxx
126+
ImageName: <your-xio-vm-image-name>
127+
128+
DefaultImageName: <your-xio-vm-image-name>
129+
130+
InstanceTypes:
131+
- c5a:1
132+
- m5a:2
133+
- r5a:3
134+
SpotFleetTypes:
135+
- c5a:1
136+
- m5a:2
137+
- r5a:3
138+
139+
PoolSize: 10
140+
EnableHyperthreading: false
141+
VmRootPasswordSecret: ExostellarVmRootPassword
142+
```
143+
144+
### Update the cluster with the XWO configuration
145+
146+
Update the cluster with the XWO configuration.
147+
148+
This will update the profiles and environment on the EMS server and configure the cluster for XWO.
149+
The only remaining step before you can submit jobs is to create the XWO VM image.
150+
151+
This is done before creating an image because the XWO scripts get deployed by this step.
152+
153+
### Create an XWO Image from the XWO ParallelCluster AMI
154+
155+
Connect to the head node and create the XWO Image from the AMI you created.
156+
The IMAGE-NAME should be the same that you configured in the Pools.
157+
158+
```
159+
/opt/slurm/etc/exostellar/parse_helper.sh -a <AMI-ID1> -i <IMAGE-NAME>
160+
```
161+
162+
### Configure the Exostellar Certificates on the Head Node
163+
164+
The resume script needs an Exostellar certificate authority and client security certificate
165+
to be able to call the REST API on the EMS.
166+
Download the certificates and copy them to the head node.
167+
168+
* Open the EMS in a browser.
169+
* Click on Settings
170+
* Select the Certificates tab
171+
* Click on **Generate Client Certificate**, name it **ExostellarClient.pem**, and save it in the Downloads folder.
172+
* Click on **Download Exostellar CA**, name it **ExostellarRootCA.crt**, and save it in the Downloads folder.
173+
* Copy the two certificates to /etc/ssl/certs/ on the head node.
174+
175+
### Test launching an XWO VM
176+
177+
Connect to the head node and test launching a VM.
178+
The pool, profile, and image_name should be from your configuration.
179+
The host name doesn't matter.
180+
181+
```
182+
/opt/slurm/etc/exostellar/test_createVm.sh --pool <pool> --profile <profile> -i <image name> -h <host>
183+
```
184+
185+
When this is done, the VM, worker, and controller should all terminate on their own.
186+
If they do not, then connect to the EMS and cancel the job that started the controller.
187+
188+
Use `squeue` to list the controller jobs. Use `scancel` to terminate them.
189+
190+
### Run a test job using Slurm
191+
192+
```
193+
srun --pty -p xwo-amd-64g-4c hostname
194+
```
195+
196+
## Debug
197+
198+
### UpdateHeadNode resource failed
199+
200+
If the UpdateHeadNode resource fails then it is usually because a task in the ansible script failed.
201+
Connect to the head node and look for errors in:
202+
203+
```/var/log/ansible.log```
204+
205+
Usually it will be a problem with the `/opt/slurm/etc/exostellar/configure_xio.py` script.
206+
207+
When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FAILED status.
208+
Before you can update it again you will need to complete the rollback.
209+
Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`.
210+
211+
### XWO Controller not starting
212+
213+
If a controller doesn't start, then the first thing to check is to make sure that the
214+
`/opt/slurm/exostellar/resume_xspot.sh` script ran successfully on the head node.
215+
216+
`grep resume_xspot.sh /var/log/messages | less`
217+
218+
The script should get "http_code=200". If not, investigate the error.
219+
220+
If the resume script passed, then a controller should have started.
221+
222+
On EMS, check that a job is running to create the controller.
223+
224+
`squeue`
225+
226+
On EMS, check the autoscaling log to see if there are errors starting the instance.
227+
228+
`less /var/log/slurm/autoscaling.log`
229+
230+
EMS Slurm partions are at:
231+
232+
`/xcompute/slurm/bin/partitions.json`
233+
234+
They are derived from the partition and pool names.
235+
236+
### Worker instance not starting
237+
238+
### VM not starting on worker
239+
240+
### VM not starting Slurm job

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ nav:
99
- 'config.md'
1010
- 'res_integration.md'
1111
- 'soca_integration.md'
12+
- 'exostellar-workload-optimizer.md'
1213
- 'exostellar-infrastructure-optimizer.md'
1314
- 'custom-amis.md'
1415
- 'run_jobs.md'

0 commit comments

Comments
 (0)