WEKA-GCP Cluster Toolkit integration

An external module for deploying a WEKA file system with Google Cloud's Cluster toolkit.

License and Support

This repository is licensed for use under a 3-Clause BSD Open Source license so that you can use this resource to experiment with deploying your own complex high performance computing infrastructure on Google Cloud. Fluid Numerics offers expert support to help you design, deploy, and manage performant and cost-effective infrastructure on Google Cloud to support high performance computing and AI/ML workloads. Learn more at https://www.fluidnumerics.com/services or reach out to [email protected] .

Overview

WEKA provides a terraform module for deploying a parallel WEKA filesystem on Google Cloud Platform. This repository is meant to provide a clean integration with Google Cloud's Cluster Toolkit. Specifically, we aim to provide a minimal terraform module for deploying a WEKA filesystem in a dedicated backend architecture. Additionally, we provide example cluster toolkit deployments that integrate WEKA with Slurm-GCP following WEKA's best practices.

Example

Summary

In this section, we walk through a simple example deployment that is included in this repository

Important

Before proceeding, you need to have the following components installed on your workstation:

Additionally, you will need :

A download token from get.weka.io
A Google Cloud project with active billing

The Cluster Toolkit allows you to define complex architecture for high performance computing and AI/ML applications on Google Cloud in a single "blueprint" file in YAML syntax. This example uses the bluprint defined in aiml-slurmgcp6-weka4.yaml. This blueprint is used to create

A virtual machine image built on top of the Slurm-GCP Rocky Linux 8 VM image that includes the WEKA agent software and adjustments described in WEKA's Slurm integration guide
Networking infrastructure for VM image baking and cluster deployment
A WEKA parallel filesystem consisting of six c2-standard-8 instances with each equipped with 2x 375GB NVME Local SSD's and four NIC's.
Slurm controller (c2-standard-4) and login node (c2-standard-4) with WEKA filesystem mounted to /home
Heterogeneous Slurm partition with VM instances equipped with A100 (a2-highgpu) and L4 (g2-standard) GPUs configured with Slurm features and additional memory set aside for the WEKA agent

Note that in this deployment, all Slurm instances have a single NIC and mount WEKA using UDP mode. If you would like to work with DPDK mounts and would like assistance, please open an issue.

Walkthrough

Clone this repository and navigate to the example/ directory

git clone https://github.com/FluidNumerics/weka-gcp-hpc-toolkit ~/weka-gcp-hpc-toolkit
cd ~/weka-gcp-hpc-toolkit/example

Edit the provided aiml-slurmgcp6-weka4.yaml blueprint file to specify the project_id and get_weka_io_token. The project_id is the Google Cloud project ID you wish to deploy your cluster to. The get_weka_io_token is your download token for the WEKA software obtained from get.weka.io. You may also wish to change the region and zone, but it is not required.
Use the Google Cloud Cluster toolkit to create the terraform infrastructure-as-code. This will create a subdirectory called aiml-slurm6-weka4 that houses the Packer files for creating the VM image and Terraform infrastructure-as-code for all of the other resources. This subdirectory will also contain a set of instructions aiml-slurm6-weka4/instructions.txt that provide an advanced set of steps for manually deploying the infrastructure.

Note

The binary for the cluster toolkit may be called gcluster (newest), ghpc, or hpc-toolkit, depending on the version of the cluster toolkit you are using.

gcluster create aiml-slurmgcp6-weka4.yaml

Deploy the primary infrastructure that is needed to support the VM image baking process.

terraform -chdir=aiml-slurm6-weka4/primary init
terraform -chdir=aiml-slurm6-weka4/primary validate
terraform -chdir=aiml-slurm6-weka4/primary apply
gcluster export-outputs aiml-slurm6-weka4/primary

Create the VM image that will be used for your Slurm-GCP instances with the WEKA agent pre-installed.

gcluster import-inputs aiml-slurm6-weka4/packer
cd aiml-slurm6-weka4/packer/weka-enabled-image
packer init .
packer validate .
packer build .
cd -

Deploy the WEKA filesystem and Slurm-GCP cluster

gcluster import-inputs aiml-slurm6-weka4/cluster
terraform -chdir=aiml-slurm6-weka4/cluster init
terraform -chdir=aiml-slurm6-weka4/cluster validate
terraform -chdir=aiml-slurm6-weka4/cluster apply

Once complete, you will have a WEKA filesystem and autoscaling Slurm-GCP cluster in your Google Cloud project.

Destroying resources

When you no longer need your resources, you can use the gcluster cli to delete all infrastructure

cd ~/weka-gcp-hpc-toolkit/example
gcluster destroy aiml-slurm6-weka4

If, instead, you prefer to destroy resources manually, keep in mind that all infrastructure should be destroyed in reverse order of creation:

terraform -chdir=aiml-slurm6-weka4/cluster destroy
terraform -chdir=aiml-slurm6-weka4/primary destroy

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
example		example
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.tf		main.tf
metadata.yaml		metadata.yaml
outputs.tf		outputs.tf
variables.tf		variables.tf
versions.tf		versions.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WEKA-GCP Cluster Toolkit integration

License and Support

Overview

Example

Summary

Walkthrough

Destroying resources

Further Reading

About

Releases

Packages

Languages

License

FluidNumerics/weka-gcp-cluster-toolkit

Folders and files

Latest commit

History

Repository files navigation

WEKA-GCP Cluster Toolkit integration

License and Support

Overview

Example

Summary

Walkthrough

Destroying resources

Further Reading

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages