|
| 1 | +# Exostellar Workload Optimizer (XWO) |
| 2 | + |
| 3 | +[Exostellar Workload Optimizer](https://exostellar.io/product/#workloadoptimizer) (XWO) runs applications in virtual machines (VMs) on EC2 instances and can dynamically migrate the VMs between instances based on actual memory utilization. |
| 4 | +This can provide significant savings when users over-provision memory or provision memory |
| 5 | +based on peak usage by running on instances with less memory when the extra memory isn't |
| 6 | +required. |
| 7 | +It also, optionally, provides the functionality of [Infrastructure Optimizer](https://exostellar.io/product/#infrastructureoptimizer) which migrates |
| 8 | +VMs between Spot and On-Demnad instances based on availability and cost. |
| 9 | + |
| 10 | +XWO runs on an Exostellar Management Server (EMS). |
| 11 | +The EMS runs a web application and launches and manages the instances that run jobs. |
| 12 | +In response to job requests it launches controller nodes that manage pools of worker nodes. |
| 13 | +The controller launches workers and then starts one or more VMs on the workers. |
| 14 | +The controller also determines when VMs need to be migrated, allocates new workers, and manages the VM migrations. |
| 15 | + |
| 16 | +XWO Profiles configure XWO Controllers and XWO Workers. |
| 17 | +The Workers run the XWO VMs. |
| 18 | +The Controller manages the workers and the VMs that run on them. |
| 19 | +The Worker configuration includes the instance types to use for |
| 20 | +on-demand and spot instances. |
| 21 | +It also includes the security groups and tags for the worker instances. |
| 22 | + |
| 23 | +You create an XWO Application Environment for each Slurm cluster. |
| 24 | +The Application Environment contains the URL for the Slurm head node, |
| 25 | +configures pools of VMs, |
| 26 | +and configures the path to the Slurm binaries and configuration. |
| 27 | +The VM pools define the attributes of the instances including the number of CPUs, VM Image, min and max memory, and an associated XWO Profile. |
| 28 | + |
| 29 | +You must also create XWO Images that are used to create the VMs. |
| 30 | +The Images are created from AWS AMIs and are specified in the Pools. |
| 31 | + |
| 32 | +**NOTE:** One current restriction of XWO VMs is that they cannot be created from ParallelCluster AMIs. |
| 33 | +This is because the kernel modules that ParallelCluster installs aren't supported by the XWO hypervisor. |
| 34 | + |
| 35 | +## XWO Configuration |
| 36 | + |
| 37 | +This section will describe the process of configuring XWO to work with ParallelCluster. |
| 38 | + |
| 39 | +Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/HPC-User/getting-started-installation) to make sure you have the latest instructions. |
| 40 | + |
| 41 | +### Deploy ParallelCluster without configuring XWO |
| 42 | + |
| 43 | +First deploy your cluster without configuring XWO. |
| 44 | +The cluster deploys ansible playbooks that will be used to create the XWO ParallelCluster AMI. |
| 45 | + |
| 46 | +### Deploy the Exostellar Management Server (EMS) |
| 47 | + |
| 48 | +The next step is to [install the Exostellar management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server). |
| 49 | +You must first subscribe to the three Exostellar Infrastructure AMIs in the AWS Marketplace. |
| 50 | + |
| 51 | +* [Exostellar Management Server](https://aws.amazon.com/marketplace/server/procurement?productId=prod-crdnafbqnbnm2) |
| 52 | +* [Exostellar Controller](https://aws.amazon.com/marketplace/server/procurement?productId=prod-d4lifqwlw4kja) |
| 53 | +* [Exostellar Worker](https://aws.amazon.com/marketplace/server/procurement?productId=prod-2smeyk5fuxt7q) |
| 54 | + |
| 55 | +Then follow the [directions to deploy the CloudFormation template](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server#v2.4.0.0InstallingwithCloudFormationTemplate(AWS)-Step3:CreateaNewStack). |
| 56 | + |
| 57 | +### Verify that the "az1" profile exists |
| 58 | + |
| 59 | +In the EMS GUI go to Profiles and make sure that the "az1" profile exists. |
| 60 | +I use that as a template to create your new profiles. |
| 61 | + |
| 62 | +If it doesn't exist, there was a problem with the EMS deployment and you should contact Exostellar support. |
| 63 | + |
| 64 | +### Create an XWO ParallelCluster AMI |
| 65 | + |
| 66 | +Launch an instance using the base AMI for your OS. |
| 67 | +For example, launch an instance with a base RHEL 8 or Rocky 8 AMI. |
| 68 | + |
| 69 | +Mount the ParallelCluster NFS file system at /opt/slurm. |
| 70 | + |
| 71 | +Run the ansible playbook to configure the instance for XWO. |
| 72 | + |
| 73 | +``` |
| 74 | +/opt/slurm/config/bin/exostellar-compute-node-ami-configure.sh |
| 75 | +``` |
| 76 | + |
| 77 | +Do any additional configuration that you require such as configuring file system mounts and installing |
| 78 | +packages. |
| 79 | + |
| 80 | +Create an AMI from the instance and wait for it to become available. |
| 81 | + |
| 82 | +After the AMI has been successfully created you can either stop or terminate the instance to save costs. |
| 83 | +If you may need to do additional customization, then stop it, otherwise terminate it. |
| 84 | + |
| 85 | +Add the image id to your configuration as described below. |
| 86 | + |
| 87 | +### Create XWO Configuration |
| 88 | + |
| 89 | +The next step is to plan and configure your XWO deployment. |
| 90 | +The key decisions that you must make are the instance types that you will use |
| 91 | +and the AMI that you will use for the XWO VM Images. |
| 92 | + |
| 93 | +XWO currently only supports x86_64 instance types and pools cannot mix AMD and Intel instance types. |
| 94 | +The configuration has been simplified so that all you have to do is specify the instance types |
| 95 | +and families that you want to use. |
| 96 | +The instance types will be grouped by the number of cores and amount of memory to create |
| 97 | +pools and Slurm partitions. |
| 98 | +You can still create your own profiles and pools if the automatically generated ones do not |
| 99 | +meet your needs. |
| 100 | +The example only shows the simplified configuration. |
| 101 | + |
| 102 | +**NOTE**: XWO currently doesn't support VMs larger than 1 TB. |
| 103 | + |
| 104 | +Refer to [Best practices for Amazon EC2 Spot](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html) when planning your cluster deployment and creating your configuration. |
| 105 | + |
| 106 | +It is highly recommended to use [EC2 Spot placement scores](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/work-with-spot-placement-score.html) when selecting the region and availability zone for your cluster. |
| 107 | +This will give you an indication of the likelihood of getting desired spot capacity. |
| 108 | + |
| 109 | +In the following example, |
| 110 | +I've included instance families from the last 3 generations of instances to maximize the number of |
| 111 | +available capacity pools and increase the likelihood of running on spot. |
| 112 | + |
| 113 | +**Note**: The Intel instance families contain more configurations and higher memory instances. They also have high frequency instance types such as m5zn, r7iz, and z1d. They also tend to have more capacity. The AMD instance families include HPC instance types, however, they do not support spot pricing and can only be used for on-demand. |
| 114 | + |
| 115 | +**Note**: This is only an example configuration. You should customize it for your requirements. |
| 116 | + |
| 117 | +``` |
| 118 | +slurm: |
| 119 | + Xwo: |
| 120 | + ManagementServerStackName: exostellar-management-server |
| 121 | + PartitionName: xwo |
| 122 | + AvailabilityZone: us-east-1b |
| 123 | +
|
| 124 | + Images: |
| 125 | + - ImageId: ami-xxxxxxxxxxxxxxxxx |
| 126 | + ImageName: <your-xio-vm-image-name> |
| 127 | +
|
| 128 | + DefaultImageName: <your-xio-vm-image-name> |
| 129 | +
|
| 130 | + InstanceTypes: |
| 131 | + - c5a:1 |
| 132 | + - m5a:2 |
| 133 | + - r5a:3 |
| 134 | + SpotFleetTypes: |
| 135 | + - c5a:1 |
| 136 | + - m5a:2 |
| 137 | + - r5a:3 |
| 138 | +
|
| 139 | + PoolSize: 10 |
| 140 | + EnableHyperthreading: false |
| 141 | + VmRootPasswordSecret: ExostellarVmRootPassword |
| 142 | +``` |
| 143 | + |
| 144 | +### Update the cluster with the XWO configuration |
| 145 | + |
| 146 | +Update the cluster with the XWO configuration. |
| 147 | + |
| 148 | +This will update the profiles and environment on the EMS server and configure the cluster for XWO. |
| 149 | +The only remaining step before you can submit jobs is to create the XWO VM image. |
| 150 | + |
| 151 | +This is done before creating an image because the XWO scripts get deployed by this step. |
| 152 | + |
| 153 | +### Create an XWO Image from the XWO ParallelCluster AMI |
| 154 | + |
| 155 | +Connect to the head node and create the XWO Image from the AMI you created. |
| 156 | +The IMAGE-NAME should be the same that you configured in the Pools. |
| 157 | + |
| 158 | +``` |
| 159 | +/opt/slurm/etc/exostellar/parse_helper.sh -a <AMI-ID1> -i <IMAGE-NAME> |
| 160 | +``` |
| 161 | + |
| 162 | +### Configure the Exostellar Certificates on the Head Node |
| 163 | + |
| 164 | +The resume script needs an Exostellar certificate authority and client security certificate |
| 165 | +to be able to call the REST API on the EMS. |
| 166 | +Download the certificates and copy them to the head node. |
| 167 | + |
| 168 | +* Open the EMS in a browser. |
| 169 | +* Click on Settings |
| 170 | +* Select the Certificates tab |
| 171 | +* Click on **Generate Client Certificate**, name it **ExostellarClient.pem**, and save it in the Downloads folder. |
| 172 | +* Click on **Download Exostellar CA**, name it **ExostellarRootCA.crt**, and save it in the Downloads folder. |
| 173 | +* Copy the two certificates to /etc/ssl/certs/ on the head node. |
| 174 | + |
| 175 | +### Test launching an XWO VM |
| 176 | + |
| 177 | +Connect to the head node and test launching a VM. |
| 178 | +The pool, profile, and image_name should be from your configuration. |
| 179 | +The host name doesn't matter. |
| 180 | + |
| 181 | +``` |
| 182 | +/opt/slurm/etc/exostellar/test_createVm.sh --pool <pool> --profile <profile> -i <image name> -h <host> |
| 183 | +``` |
| 184 | + |
| 185 | +When this is done, the VM, worker, and controller should all terminate on their own. |
| 186 | +If they do not, then connect to the EMS and cancel the job that started the controller. |
| 187 | + |
| 188 | +Use `squeue` to list the controller jobs. Use `scancel` to terminate them. |
| 189 | + |
| 190 | +### Run a test job using Slurm |
| 191 | + |
| 192 | +``` |
| 193 | +srun --pty -p xwo-amd-64g-4c hostname |
| 194 | +``` |
| 195 | + |
| 196 | +## Debug |
| 197 | + |
| 198 | +### UpdateHeadNode resource failed |
| 199 | + |
| 200 | +If the UpdateHeadNode resource fails then it is usually because a task in the ansible script failed. |
| 201 | +Connect to the head node and look for errors in: |
| 202 | + |
| 203 | +```/var/log/ansible.log``` |
| 204 | + |
| 205 | +Usually it will be a problem with the `/opt/slurm/etc/exostellar/configure_xio.py` script. |
| 206 | + |
| 207 | +When this happens the CloudFormation stack will usually be in UPDATE_ROLLBACK_FAILED status. |
| 208 | +Before you can update it again you will need to complete the rollback. |
| 209 | +Go to Stack Actions, select `Continue update rollback`, expand `Advanced troubleshooting`, check the UpdateHeadNode resource, anc click `Continue update rollback`. |
| 210 | + |
| 211 | +### XWO Controller not starting |
| 212 | + |
| 213 | +If a controller doesn't start, then the first thing to check is to make sure that the |
| 214 | +`/opt/slurm/exostellar/resume_xspot.sh` script ran successfully on the head node. |
| 215 | + |
| 216 | +`grep resume_xspot.sh /var/log/messages | less` |
| 217 | + |
| 218 | +The script should get "http_code=200". If not, investigate the error. |
| 219 | + |
| 220 | +If the resume script passed, then a controller should have started. |
| 221 | + |
| 222 | +On EMS, check that a job is running to create the controller. |
| 223 | + |
| 224 | +`squeue` |
| 225 | + |
| 226 | +On EMS, check the autoscaling log to see if there are errors starting the instance. |
| 227 | + |
| 228 | +`less /var/log/slurm/autoscaling.log` |
| 229 | + |
| 230 | +EMS Slurm partions are at: |
| 231 | + |
| 232 | +`/xcompute/slurm/bin/partitions.json` |
| 233 | + |
| 234 | +They are derived from the partition and pool names. |
| 235 | + |
| 236 | +### Worker instance not starting |
| 237 | + |
| 238 | +### VM not starting on worker |
| 239 | + |
| 240 | +### VM not starting Slurm job |
0 commit comments