Skip to content

wandb/terraform-aws-wandb

Repository files navigation

Weights & Biases AWS Module

IMPORTANT: You are viewing a beta version of the official module to install Weights & Biases. This new version is incompatible with earlier versions, and it is not currently meant for production use. Please contact your Customer Success Manager for details before using.

This is a Terraform module for provisioning a Weights & Biases Cluster on AWS. Weights & Biases Local is our self-hosted distribution of wandb.ai. It offers enterprises a private instance of the Weights & Biases application, with no resource limits and with additional enterprise-grade architectural features like audit logging and SAML single sign-on.

About This Module

Pre-requisites

This module is intended to run in an AWS account with minimal preparation, however it does have the following pre-requisites:

Terrafom version >= 1.9

Credentials / Permissions

AWS Services Used

  • AWS Identity & Access Management (IAM)
  • AWS Key Management System (KMS)
  • Amazon Aurora MySQL (RDS)
  • Amazon VPC
  • Amazon S3
  • Amazon Route53
  • Amazon Certificate Manager (ACM)
  • Amazon Elastic Loadbalancing (ALB)
  • Amazon Secrets Manager

Public Hosted Zone

If you are managing DNS via AWS Route53 the hosted zone entry is created automatically as part of your domain management.

If you're managing DNS outside of Route53, you will need to:

  1. Create a Route53 zone name {subdomain}.{domain} (e.g test.wandb.ai)
  2. Create a NS record in your parent system and point it to the newly created Route53
  3. Enable the external_dns option in this module

You can learn more about creating a hosted zone for a subdomain, which you will need to do for the subdomain you are planning to use for your Weights & Biases installation. To create this hosted zone with Terraform, use the aws_route53_zone resource.

ACM Certificate

While this is not required, it is recommend to already have an existing ACM certification. Certificate validation can take up two hours, causing timeouts during module apply if the cert is generated as one of the resources contained in the module.

How to Use This Module

  • Ensure account meets module pre-requisites from above.

  • Please note that while some resources are individually and uniquely tagged, all common tags are expected to be configured within the AWS provider as shown in the example code snippet below.

  • Create a Terraform configuration that pulls in this module and specifies values of the required variables:

provider "aws" {
  region = "<your AWS region>"
  default_tags {
    tags = var.common_tags
  }
}

module "wandb" {
  source    = "<filepath to cloned module directory>"
  namespace = "<prefix for naming AWS resources>"
}
  • Run terraform init and terraform apply

Cluster Sizing

By default, the type of kubernetes instances, number of instances, redis cluster size, and database instance sizes are standardized via configurations in ./deployment-size.tf, and is configured via the size input variable.

Available sizes are, small, medium, large, xlarge, and xxlarge. Default is small.

All the values set via deployment-size.tf can be overridden by setting the appropriate input variables.

  • kubernetes_instance_types - The instance type for the EKS nodes
  • kubernetes_min_nodes_per_az - The minimum number of nodes in each AZ for the EKS cluster
  • kubernetes_max_nodes_per_az - The maximum number of nodes in each AZ for the EKS cluster
  • elasticache_node_type - The instance type for the redis cluster
  • database_instance_class - The instance type for the database

Examples

We have included documentation and reference examples for additional common installation scenarios for Weights & Biases, as well as examples for supporting resources that lack official modules.

A note on updating EKS cluster version

Users can update the EKS cluster version to the latest version offered by AWS. This can be done using the environment variable eks_cluster_version. Note that, cluster and nodegroup version updates can only be done in increments of one version at a time. For example, if your current cluster version is 1.21 and the latest version available is 1.25 - you'd need to:

  1. update the cluster version in the app_eks module from 1.21 to 1.22
  2. run terraform apply
  3. update the cluster version to 1.23
  4. run terraform apply
  5. update the cluster version to 1.24 ...and so on and so forth.

Upgrades must be executed in step-wise fashion from one version to the next. You cannot skip versions when upgrading EKS.

Requirements

Name Version
terraform ~> 1.0
aws ~> 4.0
kubernetes ~> 2.23

Providers

Name Version
aws 4.67.0

Modules

Name Source Version
acm terraform-aws-modules/acm/aws ~> 3.0
app_eks ./modules/app_eks n/a
app_lb ./modules/app_lb n/a
database ./modules/database n/a
file_storage ./modules/file_storage n/a
iam_role ./modules/iam_role n/a
kms ./modules/kms n/a
networking ./modules/networking n/a
private_link ./modules/private_link n/a
redis ./modules/redis n/a
s3_endpoint ./modules/endpoint n/a
wandb wandb/wandb/helm 1.2.0

Resources

Name Type
aws_region.current data source
aws_s3_bucket.file_storage data source
aws_sqs_queue.file_storage data source

Inputs

Name Description Type Default Required
acm_certificate_arn The ARN of an existing ACM certificate. string null no
allowed_inbound_cidr CIDRs allowed to access wandb-server. list(string) n/a yes
allowed_inbound_ipv6_cidr CIDRs allowed to access wandb-server. list(string) n/a yes
allowed_private_endpoint_cidr Private CIDRs allowed to access wandb-server. list(string) [] no
app_wandb_env Extra environment variables for W&B map(string) {} no
aws_loadbalancer_controller_tags (Optional) A map of AWS tags to apply to all resources managed by the load balancer controller map(string) {} no
bucket_kms_key_arn n/a string "" no
bucket_name n/a string "" no
bucket_path path of where to store data for the instance-level bucket string "" no
clickhouse_endpoint_service_id The service ID of the VPC endpoint service for Clickhouse string "" no
controller_image_tag Tag of the controller image to deploy string "1.14.0" no
create_bucket ######################################### External Bucket # ######################################### Most users will not need these settings. They are ment for users who want a bucket and sqs that are in a different account. bool true no
create_elasticache Boolean indicating whether to provision an elasticache instance (true) or not (false). bool true no
create_vpc Boolean indicating whether to deploy a VPC (true) or not (false). bool true no
custom_domain_filter A custom domain filter to be used by external-dns instead of the default FQDN. If not set, the local FQDN is used. string null no
database_binlog_format Specifies the binlog_format value to set for the database string "ROW" no
database_engine_version Version for MySQL Aurora string "8.0.mysql_aurora.3.07.1" no
database_innodb_lru_scan_depth Specifies the innodb_lru_scan_depth value to set for the database number 128 no
database_instance_class Instance type to use by database master instance. string "db.r5.large" no
database_kms_key_arn n/a string "" no
database_master_username Specifies the master_username value to set for the database string "wandb" no
database_name Specifies the name of the database string "wandb_local" no
database_performance_insights_kms_key_arn Specifies an existing KMS key ARN to encrypt the performance insights data if performance_insights_enabled is was enabled out of band string "" no
database_snapshot_identifier Specifies whether or not to create this cluster from a snapshot. You can use either the name or ARN when specifying a DB cluster snapshot, or the ARN when specifying a DB snapshot string null no
database_sort_buffer_size Specifies the sort_buffer_size value to set for the database number 67108864 no
deletion_protection If the instance should have deletion protection enabled. The database / S3 can't be deleted when this value is set to true. bool true no
domain_name Domain for accessing the Weights & Biases UI. string n/a yes
eks_cluster_version EKS cluster kubernetes version string n/a yes
eks_policy_arns Additional IAM policy to apply to the EKS cluster list(string) [] no
elasticache_node_type The type of the redis cache node to deploy string "cache.t2.medium" no
enable_clickhouse Provision clickhouse resources bool false no
enable_yace deploy yet another cloudwatch exporter to fetch aws resources metrics bool true no
external_dns Using external DNS. A subdomain must also be specified if this value is true. bool false no
extra_fqdn Additional fqdn's must be in the same hosted zone as domain_name. list(string) [] no
kms_clickhouse_key_alias KMS key alias for AWS KMS Customer managed key used by Clickhouse CMEK. string null no
kms_clickhouse_key_policy The policy that will define the permissions for the clickhouse kms key. string "" no
kms_key_alias KMS key alias for AWS KMS Customer managed key. string null no
kms_key_deletion_window Duration in days to destroy the key after it is deleted. Must be between 7 and 30 days. number 7 no
kms_key_policy The policy that will define the permissions for the kms key. string "" no
kms_key_policy_administrator_arn The principal that will be allowed to manage the kms key. string "" no
kubernetes_alb_internet_facing Indicates whether or not the ALB controlled by the Amazon ALB ingress controller is internet-facing or internal. bool true no
kubernetes_alb_subnets List of subnet ID's the ALB will use for ingress traffic. list(string) [] no
kubernetes_instance_types EC2 Instance type for primary node group. list(string)
[
"m5.large"
]
no
kubernetes_map_accounts Additional AWS account numbers to add to the aws-auth configmap. list(string) [] no
kubernetes_map_roles Additional IAM roles to add to the aws-auth configmap.
list(object({
rolearn = string
username = string
groups = list(string)
}))
[] no
kubernetes_map_users Additional IAM users to add to the aws-auth configmap.
list(object({
userarn = string
username = string
groups = list(string)
}))
[] no
kubernetes_node_count Number of nodes number 2 no
kubernetes_public_access Indicates whether or not the Amazon EKS public API server endpoint is enabled. bool false no
kubernetes_public_access_cidrs List of CIDR blocks which can access the Amazon EKS public API server endpoint. list(string) [] no
license Weights & Biases license key. string n/a yes
namespace String used for prefix resources. string n/a yes
network_cidr CIDR block for VPC. string "10.10.0.0/16" no
network_database_subnet_cidrs List of private subnet CIDR ranges to create in VPC. list(string)
[
"10.10.20.0/24",
"10.10.21.0/24"
]
no
network_database_subnets A list of the identities of the database subnetworks in which resources will be deployed. list(string) [] no
network_elasticache_subnet_cidrs List of private subnet CIDR ranges to create in VPC. list(string)
[
"10.10.30.0/24",
"10.10.31.0/24"
]
no
network_elasticache_subnets A list of the identities of the subnetworks in which elasticache resources will be deployed. list(string) [] no
network_id The identity of the VPC in which resources will be deployed. string "" no
network_private_subnet_cidrs List of private subnet CIDR ranges to create in VPC. list(string)
[
"10.10.10.0/24",
"10.10.11.0/24"
]
no
network_private_subnets A list of the identities of the private subnetworks in which resources will be deployed. list(string) [] no
network_public_subnet_cidrs List of private subnet CIDR ranges to create in VPC. list(string)
[
"10.10.0.0/24",
"10.10.1.0/24"
]
no
network_public_subnets A list of the identities of the public subnetworks in which resources will be deployed. list(string) [] no
operator_chart_version Version of the operator chart to deploy string "1.3.4" no
other_wandb_env Extra environment variables for W&B map(any) {} no
parquet_wandb_env Extra environment variables for W&B map(string) {} no
private_link_allowed_account_ids List of AWS account IDs allowed to access the VPC Endpoint Service list(string) [] no
private_only_traffic Enable private only traffic from customer private network bool false no
public_access Is this instance accessable a public domain. bool false no
size Deployment size string null no
ssl_policy SSL policy to use on ALB listener string "ELBSecurityPolicy-FS-1-2-Res-2020-10" no
subdomain Subdomain for accessing the Weights & Biases UI. Default creates record at Route53 Route. string null no
system_reserved_cpu_millicores (Optional) The amount of 'system-reserved' CPU millicores to pass to the kubelet. For example: 100. A value of -1 disables the flag. number 70 no
system_reserved_ephemeral_megabytes (Optional) The amount of 'system-reserved' ephemeral storage in megabytes to pass to the kubelet. For example: 1000. A value of -1 disables the flag. number 750 no
system_reserved_memory_megabytes (Optional) The amount of 'system-reserved' memory in megabytes to pass to the kubelet. For example: 100. A value of -1 disables the flag. number 100 no
system_reserved_pid (Optional) The amount of 'system-reserved' process ids [pid] to pass to the kubelet. For example: 1000. A value of -1 disables the flag. number 500 no
use_internal_queue n/a bool false no
weave_wandb_env Extra environment variables for W&B map(string) {} no
yace_sa_name n/a string "wandb-yace" no
zone_id Domain for creating the Weights & Biases subdomain on. string n/a yes

Outputs

Name Description
bucket_name n/a
bucket_path n/a
bucket_queue_name n/a
bucket_region n/a
cluster_id n/a
cluster_node_role n/a
database_connection_string n/a
database_instance_type n/a
database_password n/a
database_username n/a
eks_node_count n/a
eks_node_instance_type n/a
elasticache_connection_string n/a
kms_clickhouse_key_arn The Amazon Resource Name of the KMS key used to encrypt Weave data at rest in Clickhouse.
kms_key_arn The Amazon Resource Name of the KMS key used to encrypt data at rest.
network_id The identity of the VPC in which resources are deployed.
network_private_subnets The identities of the private subnetworks deployed within the VPC.
network_public_subnets The identities of the public subnetworks deployed within the VPC.
redis_instance_type n/a
standardized_size n/a
url The URL to the W&B application

Migrations

Upgrading to Operator

See our upgrade guide here

Upgrading from 4.x -> 5.x

5.0.0 introduced autoscaling to the EKS cluster and made the size variable the preferred way to set the cluster size. Previously, unless the size variable was set explicitly, there were default values for the following variables:

  • kubernetes_instance_types
  • kubernetes_node_count
  • elasticache_node_type
  • database_instance_class

The size variable is now defaulted to small, and the following values to can be used to partially override the values set by the size variable:

  • kubernetes_instance_types
  • kubernetes_min_nodes_per_az
  • kubernetes_max_nodes_per_az
  • elasticache_node_type
  • database_instance_class

For more information on the available sizes, see the Cluster Sizing section.

If having the cluster scale nodes in and out is not desired, the kubernetes_min_nodes_per_az and kubernetes_max_nodes_per_az can be set to the same value to prevent the cluster from scaling.

This upgrade is also intended to be used when upgrading eks to 1.29.

We have upgraded the following dependencies and Kubernetes addons:

  • MySQL Aurora (8.0.mysql_aurora.3.07.1)
  • redis (7.1)
  • external-dns helm chart (v1.15.0)
  • aws-efs-csi-driver (v2.0.7-eksbuild.1)
  • aws-ebs-csi-driver (v1.35.0-eksbuild.1)
  • coredns (v1.11.3-eksbuild.1)
  • kube-proxy (v1.29.7-eksbuild.9)
  • vpc-cni (v1.18.3-eksbuild.3)

⚠️ Please remove the enable_dummy_dns and enable_operator_alb variables as they are no longer valid flags. They were provided to support older versions of the module that relied on an alb not created by the ingress controller.

Upgrading from 3.x -> 4.x

  • If egress access for retrieving the wandb/controller image is not available, Terraform apply may experience failures.
  • It's necessary to supply a license variable within the module, as shown:
module "wandb" {
  version = "4.x"

  # ...
  license = "<your license key>"
  # ...
}

Alow customer specific customer-managed keys for S3 and RDS

  • we can provide external kms key to encrypt database, redis and S3 buckets.
  • To provide kms keys we need to provide kms arn values in
database_kms_key_arn
bucket_kms_key_arn

In order to allow cross account KMS keys. we need to allow kms keys to be accessed by WandB account.

This can be donw by adding the following policy document.

{
      "Sid": "Allow use of the key",
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<Account_id>:root"
        ]
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey",
        "kms:CreateGrant"
      ],
      "Resource": "*"
    }

6.x -> 7.x

v7 changes how the module references storage from using terraform's count to always creating a "defaultBucket" which can be overidden latter or but providing some initial bucket.

We are considering this a major change because of the terraform moved block which migrates the resource. After moving to a v7 applying an earlier version of the module may result in terraform deleting your bucket.

removed the create_bucket var due to the above.

Upgrading from 2.x -> 3.x

  • No changes required by you

Upgrading from 1.x -> 2.x

  • ~>4.0 version required for AWS Provider