Skip to content

FluidNumerics/weka-gcp-cluster-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WEKA-GCP Cluster Toolkit integration

An external module for deploying a WEKA file system with Google Cloud's Cluster toolkit.

License and Support

This repository is licensed for use under a 3-Clause BSD Open Source license so that you can use this resource to experiment with deploying your own complex high performance computing infrastructure on Google Cloud. Fluid Numerics offers expert support to help you design, deploy, and manage performant and cost-effective infrastructure on Google Cloud to support high performance computing and AI/ML workloads. Learn more at https://www.fluidnumerics.com/services or reach out to [email protected] .

Overview

WEKA provides a terraform module for deploying a parallel WEKA filesystem on Google Cloud Platform. This repository is meant to provide a clean integration with Google Cloud's Cluster Toolkit. Specifically, we aim to provide a minimal terraform module for deploying a WEKA filesystem in a dedicated backend architecture. Additionally, we provide example cluster toolkit deployments that integrate WEKA with Slurm-GCP following WEKA's best practices.

Example

Summary

In this section, we walk through a simple example deployment that is included in this repository

Important

Before proceeding, you need to have the following components installed on your workstation:

Additionally, you will need :

The Cluster Toolkit allows you to define complex architecture for high performance computing and AI/ML applications on Google Cloud in a single "blueprint" file in YAML syntax. This example uses the bluprint defined in aiml-slurmgcp6-weka4.yaml. This blueprint is used to create

  • A virtual machine image built on top of the Slurm-GCP Rocky Linux 8 VM image that includes the WEKA agent software and adjustments described in WEKA's Slurm integration guide
  • Networking infrastructure for VM image baking and cluster deployment
  • A WEKA parallel filesystem consisting of six c2-standard-8 instances with each equipped with 2x 375GB NVME Local SSD's and four NIC's.
  • Slurm controller (c2-standard-4) and login node (c2-standard-4) with WEKA filesystem mounted to /home
  • Heterogeneous Slurm partition with VM instances equipped with A100 (a2-highgpu) and L4 (g2-standard) GPUs configured with Slurm features and additional memory set aside for the WEKA agent

Note that in this deployment, all Slurm instances have a single NIC and mount WEKA using UDP mode. If you would like to work with DPDK mounts and would like assistance, please open an issue.

Walkthrough

  1. Clone this repository and navigate to the example/ directory
git clone https://github.com/FluidNumerics/weka-gcp-hpc-toolkit ~/weka-gcp-hpc-toolkit
cd ~/weka-gcp-hpc-toolkit/example
  1. Edit the provided aiml-slurmgcp6-weka4.yaml blueprint file to specify the project_id and get_weka_io_token. The project_id is the Google Cloud project ID you wish to deploy your cluster to. The get_weka_io_token is your download token for the WEKA software obtained from get.weka.io. You may also wish to change the region and zone, but it is not required.

  2. Use the Google Cloud Cluster toolkit to create the terraform infrastructure-as-code. This will create a subdirectory called aiml-slurm6-weka4 that houses the Packer files for creating the VM image and Terraform infrastructure-as-code for all of the other resources. This subdirectory will also contain a set of instructions aiml-slurm6-weka4/instructions.txt that provide an advanced set of steps for manually deploying the infrastructure.

Note

The binary for the cluster toolkit may be called gcluster (newest), ghpc, or hpc-toolkit, depending on the version of the cluster toolkit you are using.

gcluster create aiml-slurmgcp6-weka4.yaml
  1. Deploy the primary infrastructure that is needed to support the VM image baking process.
terraform -chdir=aiml-slurm6-weka4/primary init
terraform -chdir=aiml-slurm6-weka4/primary validate
terraform -chdir=aiml-slurm6-weka4/primary apply
gcluster export-outputs aiml-slurm6-weka4/primary
  1. Create the VM image that will be used for your Slurm-GCP instances with the WEKA agent pre-installed.
gcluster import-inputs aiml-slurm6-weka4/packer
cd aiml-slurm6-weka4/packer/weka-enabled-image
packer init .
packer validate .
packer build .
cd -
  1. Deploy the WEKA filesystem and Slurm-GCP cluster
gcluster import-inputs aiml-slurm6-weka4/cluster
terraform -chdir=aiml-slurm6-weka4/cluster init
terraform -chdir=aiml-slurm6-weka4/cluster validate
terraform -chdir=aiml-slurm6-weka4/cluster apply

Once complete, you will have a WEKA filesystem and autoscaling Slurm-GCP cluster in your Google Cloud project.

Destroying resources

When you no longer need your resources, you can use the gcluster cli to delete all infrastructure

cd ~/weka-gcp-hpc-toolkit/example
gcluster destroy aiml-slurm6-weka4

If, instead, you prefer to destroy resources manually, keep in mind that all infrastructure should be destroyed in reverse order of creation:

terraform -chdir=aiml-slurm6-weka4/cluster destroy
terraform -chdir=aiml-slurm6-weka4/primary destroy

Further Reading

About

Deploy WEKA filesystems with Google Cloud HPC toolkit

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published