From bfac441c57870c8c8c9b99d10950d87570d2b609 Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Thu, 9 Oct 2025 14:22:31 +0000 Subject: [PATCH 1/9] docs: refactor README.md --- README.md | 1591 +------------------------------- docs/installation.md | 89 ++ docs/local_testing.md | 46 + docs/permissions.md | 12 + docs/troubleshooting.md | 149 +++ docs/usage/advanced.md | 21 + docs/usage/autoprovisioning.md | 173 ++++ docs/usage/clusters.md | 261 ++++++ docs/usage/cpu.md | 30 + docs/usage/docker.md | 56 ++ docs/usage/gpu.md | 104 +++ docs/usage/inspector.md | 42 + docs/usage/job.md | 26 + docs/usage/run.md | 29 + docs/usage/storage.md | 175 ++++ docs/usage/workloads.md | 252 +++++ src/xpk/README.md | 10 - 17 files changed, 1500 insertions(+), 1566 deletions(-) create mode 100644 docs/installation.md create mode 100644 docs/local_testing.md create mode 100644 docs/permissions.md create mode 100644 docs/troubleshooting.md create mode 100644 docs/usage/advanced.md create mode 100644 docs/usage/autoprovisioning.md create mode 100644 docs/usage/clusters.md create mode 100644 docs/usage/cpu.md create mode 100644 docs/usage/docker.md create mode 100644 docs/usage/gpu.md create mode 100644 docs/usage/inspector.md create mode 100644 docs/usage/job.md create mode 100644 docs/usage/run.md create mode 100644 docs/usage/storage.md create mode 100644 docs/usage/workloads.md delete mode 100644 src/xpk/README.md diff --git a/README.md b/README.md index 6a90e8809..d68924ee9 100644 --- a/README.md +++ b/README.md @@ -46,1564 +46,43 @@ XPK supports the following TPU types: and the following GPU types: * A100 * A3-Highgpu (h100) -* A3-Mega (h100-mega) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A3-Ultra (h200) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A4 (b200) - [Create cluster](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A3-Mega (h100-mega) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A3-Ultra (h200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A4 (b200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) * A4X (gb200) and the following CPU types: * n2-standard-32 -XPK also supports [Google Cloud Storage solutions](#storage): -* [Cloud Storage FUSE](#fuse) -* [Filestore](#filestore) -* [Parallelstore](#parallelstore) -* [Block storage (Persistent Disk, Hyperdisk)](#block-storage-persistent-disk-hyperdisk) - -# Permissions needed on Cloud Console: - -* Artifact Registry Writer -* Compute Admin -* Kubernetes Engine Admin -* Logging Admin -* Monitoring Admin -* Service Account User -* Storage Admin -* Vertex AI Administrator -* Filestore Editor (This role is neccessary if you want to run `storage create` command with `--type=gcpfilestore`) - -# Installation - -There are 2 ways to install XPK: - -- via Python package installer (`pip`), -- clone from git and build from source. - -## Prerequisites - -The following tools must be installed: - -- python >= 3.10: download from [here](https://www.python.org/downloads/) -- pip: [installation instructions](https://pip.pypa.io/en/stable/installation/) -- python venv: [installation instructions](https://virtualenv.pypa.io/en/latest/installation.html) -(all three of above can be installed at once from [here](https://packaging.python.org/en/latest/guides/installing-using-linux-tools/#installing-pip-setuptools-wheel-with-linux-package-managers)) -- gcloud: install from [here](https://cloud.google.com/sdk/gcloud#download_and_install_the) and then: - - Run `gcloud init` - - [Authenticate](https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login) to Google Cloud -- kubectl: install from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_kubectl) and then: - - Install `gke-gcloud-auth-plugin` from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin) -- docker: [installation instructions](https://docs.docker.com/engine/install/) and then: - - Configure sudoless docker: [guide](https://docs.docker.com/engine/install/linux-postinstall/) - - Run `gcloud auth configure-docker` to ensure images can be uploaded to registry - -### Additional prerequisites when installing from pip - -- kueuectl: install from [here](https://kueue.sigs.k8s.io/docs/reference/kubectl-kueue/installation/) -- kjob: installation instructions [here](https://github.com/kubernetes-sigs/kjob/blob/main/docs/installation.md) - -### Additional prerequisites when installing from source - -- git: [installation instructions](https://git-scm.com/downloads/linux) -- make: install by running `apt-get -y install make` (`sudo` might be required) - -## Installation via pip - -To install XPK using pip, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-pip). Then you can install XPK simply by running: - -```shell -pip install xpk -``` - -If you see an error saying: `This environment is externally managed`, please use a virtual environment. For example: - -```shell -# One time step of creating the virtual environment -VENV_DIR=~/venvp3 -python3 -m venv $VENV_DIR - -# Activate your virtual environment -source $VENV_DIR/bin/activate - -# Install XPK in virtual environment using pip -pip install xpk -``` - -## Installation from source - -To install XPK from source, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-source). Afterwards you can install XPK from source using `make` - -```shell -# Clone the XPK repository -git clone https://github.com/google/xpk.git -cd xpk - -# Install required dependencies and build XPK with make -make install && export PATH=$PATH:$PWD/bin -``` - -If you want the dependecies to be available in your PATH please run: `echo $PWD/bin` and add its value to `PATH` in .bashrc or .zshrc file. - -If you see an error saying: `This environment is externally managed`, please use a virtual environment. For example: - -```shell -# One time step of creating the virtual environment -VENV_DIR=~/venvp3 -python3 -m venv $VENV_DIR - -# Activate your virtual environment -source $VENV_DIR/bin/activate - -# Clone the XPK repository -git clone https://github.com/google/xpk.git -cd xpk - -# Install required dependencies and build XPK with make -make install && export PATH=$PATH:$PWD/bin -``` - -# XPK for Large Scale (>1k VMs) - -Follow user instructions in [xpk-large-scale-guide.sh](xpk-large-scale-guide.sh) -to use xpk for a GKE cluster greater than 1000 VMs. Run these steps to set up a -GKE cluster with large scale training and high throughput support with XPK, and -run jobs with XPK. We recommend you manually copy commands per step and verify -the outputs of each step. - -# Example usages: - -To get started, be sure to set your GCP Project and Zone as usual via `gcloud -config set`. - -Below are reference commands. A typical journey starts with a `Cluster Create` -followed by many `Workload Create`s. To understand the state of the system you -might want to use `Cluster List` or `Workload List` commands. Finally, you can -cleanup with a `Cluster Delete`. - -If you have failures with workloads not running, use `xpk inspector` to investigate -more. - -If you need your Workloads to have persistent storage, use `xpk storage` to find out more. - -## Cluster Create - -First set the project and zone through gcloud config or xpk arguments. - -```shell -PROJECT_ID=my-project-id -ZONE=us-east5-b -# gcloud config: -gcloud config set project $PROJECT_ID -gcloud config set compute/zone $ZONE -# xpk arguments -xpk .. --zone $ZONE --project $PROJECT_ID -``` - -The cluster created is a regional cluster to enable the GKE control plane across -all zones. - -* Cluster Create (provision reserved capacity): - - ```shell - # Find your reservations - gcloud compute reservations list --project=$PROJECT_ID - # Run cluster create with reservation. - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v5litepod-256 \ - --num-slices=2 \ - --reservation=$RESERVATION_ID - ``` - -* Cluster Create (provision on-demand capacity): - - ```shell - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=4 --on-demand - ``` - -* Cluster Create (provision spot / preemptable capacity): - - ```shell - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=4 --spot - ``` - -* Cluster Create (DWS flex queued capacity): - ```shell - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=4 --flex - ``` - -* Cluster Create for Pathways: - Pathways compatible cluster can be created using `cluster create-pathways`. - ```shell - python3 xpk.py cluster create-pathways \ - --cluster xpk-pw-test \ - --num-slices=4 --on-demand \ - --tpu-type=v5litepod-16 - ``` - Note that Pathways clusters need a CPU nodepool of n2-standard-64 or higher. - -* Cluster Create for Ray: - A cluster with KubeRay enabled and a RayCluster can be created using `cluster create-ray`. - ```shell - python3 xpk.py cluster create-ray \ - --cluster xpk-rc-test \ - --ray-version=2.39.0 \ - --num-slices=4 --on-demand \ - --tpu-type=v5litepod-8 - ``` - -* Cluster Create can be called again with the same `--cluster name` to modify - the number of slices or retry failed steps. - - For example, if a user creates a cluster with 4 slices: - - ```shell - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=4 --reservation=$RESERVATION_ID - ``` - - and recreates the cluster with 8 slices. The command will rerun to create 4 - new slices: - - ```shell - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=8 --reservation=$RESERVATION_ID - ``` - - and recreates the cluster with 6 slices. The command will rerun to delete 2 - slices. The command will warn the user when deleting slices. - Use `--force` to skip prompts. - - ```shell - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=6 --reservation=$RESERVATION_ID - - # Skip delete prompts using --force. - - python3 xpk.py cluster create --force \ - --cluster xpk-test --tpu-type=v5litepod-16 \ - --num-slices=6 --reservation=$RESERVATION_ID - ``` - - and recreates the cluster with 4 slices of v4-8. The command will rerun to delete - 6 slices of v5litepod-16 and create 4 slices of v4-8. The command will warn the - user when deleting slices. Use `--force` to skip prompts. - - ```shell - python3 xpk.py cluster create \ - --cluster xpk-test --tpu-type=v4-8 \ - --num-slices=4 --reservation=$RESERVATION_ID - - # Skip delete prompts using --force. - - python3 xpk.py cluster create --force \ - --cluster xpk-test --tpu-type=v4-8 \ - --num-slices=4 --reservation=$RESERVATION_ID - ``` - -### Create Private Cluster - -XPK allows you to create a private GKE cluster for enhanced security. In a private cluster, nodes and pods are isolated from the public internet, providing an additional layer of protection for your workloads. - -To create a private cluster, use the following arguments: - -**`--private`** - -This flag enables the creation of a private GKE cluster. When this flag is set: - -* Nodes and pods are isolated from the direct internet access. -* `master_authorized_networks` is automatically enabled. -* Access to the cluster's control plane is restricted to your current machine's IP address by default. - -**`--authorized-networks`** - -This argument allows you to specify additional IP ranges (in CIDR notation) that are authorized to access the private cluster's control plane and perform `kubectl` commands. - -* Even if this argument is not set when you have `--private`, your current machine's IP address will always be given access to the control plane. -* If this argument is used with an existing private cluster, it will replace the existing authorized networks. - -**Example Usage:** - -* To create a private cluster and allow access to Control Plane only to your current machine: - - ```shell - python3 xpk.py cluster create \ - --cluster=xpk-private-cluster \ - --tpu-type=v4-8 --num-slices=2 \ - --private - ``` - -* To create a private cluster and allow access to Control Plane only to your current machine and the IP ranges `1.2.3.0/24` and `1.2.4.5/32`: - - ```shell - python3 xpk.py cluster create \ - --cluster=xpk-private-cluster \ - --tpu-type=v4-8 --num-slices=2 \ - --authorized-networks 1.2.3.0/24 1.2.4.5/32 - - # --private is optional when you set --authorized-networks - ``` - -> **Important Notes:** -> * The argument `--private` is only applicable when creating new clusters. You cannot convert an existing public cluster to a private cluster using these flags. -> * The argument `--authorized-networks` is applicable when creating new clusters or using an existing _*private*_ cluster. You cannot convert an existing public cluster to a private cluster using these flags. -> * You need to [set up a Cluster NAT for your VPC network](https://cloud.google.com/nat/docs/set-up-manage-network-address-translation#creating_nat) so that the Nodes and Pods have outbound access to the internet. This is required because XPK installs and configures components such as kueue that need access to external sources like `registry.k8.io`. - - -### Create Vertex AI Tensorboard -*Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have -[Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role -assigned to your user account.* - -Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Note that Vertex AI Tensorboard is only available in [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions. - -You can create a Vertex AI Tensorboard for your cluster with `Cluster Create` command. XPK will create a single Vertex AI Tensorboard instance per cluster. - -* Create Vertex AI Tensorboard in default region with default Tensorboard name: - -```shell -python3 xpk.py cluster create \ ---cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ ---create-vertex-tensorboard -``` - -will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*-tb-instance*) in `us-central1` (*default region*). - -* Create Vertex AI Tensorboard in user-specified region with default Tensorboard name: - -```shell -python3 xpk.py cluster create \ ---cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ ---create-vertex-tensorboard --tensorboard-region=us-west1 -``` - -will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*-tb-instance*) in `us-west1`. - -* Create Vertex AI Tensorboard in default region with user-specified Tensorboard name: - -```shell -python3 xpk.py cluster create \ ---cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ ---create-vertex-tensorboard --tensorboard-name=tb-testing -``` - -will create a Vertex AI Tensorboard with the name `tb-testing` in `us-central1`. - -* Create Vertex AI Tensorboard in user-specified region with user-specified Tensorboard name: - -```shell -python3 xpk.py cluster create \ ---cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ ---create-vertex-tensorboard --tensorboard-region=us-west1 --tensorboard-name=tb-testing -``` - -will create a Vertex AI Tensorboard instance with the name `tb-testing` in `us-west1`. - -* Create Vertex AI Tensorboard in an unsupported region: - -```shell -python3 xpk.py cluster create \ ---cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ ---create-vertex-tensorboard --tensorboard-region=us-central2 -``` - -will fail the cluster creation process because Vertex AI Tensorboard is not supported in `us-central2`. - -## Cluster Delete -* Cluster Delete (deprovision capacity): - - ```shell - python3 xpk.py cluster delete \ - --cluster xpk-test - ``` -## Cluster List -* Cluster List (see provisioned capacity): - - ```shell - python3 xpk.py cluster list - ``` -## Cluster Describe -* Cluster Describe (see capacity): - - ```shell - python3 xpk.py cluster describe \ - --cluster xpk-test - ``` - -## Cluster Cacheimage -* Cluster Cacheimage (enables faster start times): - - ```shell - python3 xpk.py cluster cacheimage \ - --cluster xpk-test --docker-image gcr.io/your_docker_image \ - --tpu-type=v5litepod-16 - ``` - -## Provisioning A3 Ultra, A3 Mega and A4 clusters (GPU machines) -To create a cluster with A3 or A4 machines, run the command below with selected device type. To create workloads on these clusters see [here](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). - -**Note:** Creating A3 Ultra, A3 Mega and A4 clusters is currently supported **only** on linux/amd64 architecture. - -Machine | Device type -:- | :- -A3 Mega | `h100-mega-80gb-8` -A3 Ultra | `h200-141gb-8` -A4 | `b200-8` - - -```shell -python3 xpk.py cluster create \ - --cluster CLUSTER_NAME --device-type DEVICE_TYPE \ - --zone=$COMPUTE_ZONE --project=$PROJECT_ID \ - --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID -``` - -Currently, the below flags/arguments are supported for A3 Mega, A3 Ultra and A4 machines: - * `--num-nodes` - * `--default-pool-cpu-machine-type` - * `--default-pool-cpu-num-nodes` - * `--reservation` - * `--spot` - * `--on-demand` (A3 Mega only) - * `--flex` - -## Running XPK on existing clusters - -In order to run XPK commands on a cluster it needs to be set up correctly. This is done automatically when creating a cluster using `xpk cluster create`. For clusters created differently (e.g.: with 'gcloud' or a Cluster Toolkit blueprint) there is a dedicated command: `xpk cluster adapt`. This command installs required config maps, kueue, jobset, CSI drivers etc. - -Currently `xpk cluster adapt` supports only the following device types: - -- `h200-141gb-8` (A3 Ultra) - -Example usage: -```shell -python3 xpk.py cluster adapt \ - --cluster=$CLUSTER_NAME --device-type=$DEVICE_TYPE \ - --zone=$COMPUTE_ZONE --project=$PROJECT_ID \ - --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID -``` - -## Storage -Currently XPK supports the below types of storages: -- [Cloud Storage FUSE](#fuse) -- [Google Cloud Filestore](#filestore) -- [Google Cloud Parallelstore](#parallelstore) -- [Google Cloud Block storages (Persistent Disk, Hyperdisk)](#block-storage-persistent-disk-hyperdisk) -- [Google Cloud Managed Lustre](#managed-lustre) - -### FUSE -A FUSE adapter lets you mount and access Cloud Storage buckets as local file systems, so workloads can read and write objects in your bucket using standard file system semantics. - -To use the GCS FUSE with XPK you need to create a [Storage Bucket](https://console.cloud.google.com/storage/). - -Once it's ready you can use `xpk storage attach` with `--type=gcsfuse` command to attach a FUSE storage instance to your cluster: - -```shell -python3 xpk.py storage attach test-fuse-storage --type=gcsfuse \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE - --mount-point='/test-mount-point' --readonly=false \ - --bucket=test-bucket --size=1 --auto-mount=false -``` - -Parameters: - -- `--type` - type of the storage, currently xpk supports `gcsfuse` and `gcpfilestore` only. -- `--auto-mount` - if set to true all workloads will have this storage mounted by default. -- `--mount-point` - the path on which this storage should be mounted for a workload. -- `--readonly` - if set to true, workload can only read from storage. -- `--size` - size of the storage in Gb. -- `--bucket` - name of the storage bucket. If not set then the name of the storage is used as a bucket name. -- `--mount-options` - comma-separated list of additional mount options for PersistentVolume ([reference](https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf#mount-options)). -- `--prefetch-metadata` - enables metadata pre-population when mounting the volume by setting parameter `gcsfuseMetadataPrefetchOnMount` to `true` ([reference](https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf#metadata-prefetch)). -- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. If set, then values from manifest override the following parameters: `--size` and `--bucket`. - -### Filestore - -A Filestore adapter lets you mount and access [Filestore instances](https://cloud.google.com/filestore/) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. - -To create and attach a GCP Filestore instance to your cluster use `xpk storage create` command with `--type=gcpfilestore`: - -```shell -python3 xpk.py storage create test-fs-storage --type=gcpfilestore \ - --auto-mount=false --mount-point=/data-fs --readonly=false \ - --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE -``` - -You can also attach an existing Filestore instance to your cluster using `xpk storage attach` command: - -```shell -python3 xpk.py storage attach test-fs-storage --type=gcpfilestore \ - --auto-mount=false --mount-point=/data-fs --readonly=false \ - --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE -``` - -The command above is also useful when attaching multiple volumes from the same Filestore instance. - -Commands `xpk storage create` and `xpk storage attach` with `--type=gcpfilestore` accept following arguments: -- `--type` - type of the storage. -- `--auto-mount` - if set to true all workloads will have this storage mounted by default. -- `--mount-point` - the path on which this storage should be mounted for a workload. -- `--readonly` - if set to true, workload can only read from storage. -- `--size` - size of the Filestore instance that will be created in Gb. -- `--tier` - tier of the Filestore instance that will be created. Possible options are: `[BASIC_HDD, BASIC_SSD, ZONAL, REGIONAL, ENTERPRISE]` -- `--access-mode` - access mode of the Filestore instance that will be created. Possible values are: `[ReadWriteOnce, ReadOnlyMany, ReadWriteMany]` -- `--vol` - file share name of the Filestore instance that will be created. -- `--instance` - the name of the Filestore instance. If not set then the name parameter is used as an instance name. Useful when connecting multiple volumes from the same Filestore instance. -- `--manifest` - path to the manifest file containing PersistentVolume, PresistentVolumeClaim and StorageClass definitions. If set, then values from manifest override the following parameters: `--access-mode`, `--size` and `--volume`. - -### Parallelstore - -A Parallelstore adapter lets you mount and access [Parallelstore instances](https://cloud.google.com/parallelstore/) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. - -To use the GCS Parallelstore with XPK you need to create a [Parallelstore Instance](https://console.cloud.google.com/parallelstore/). - -Once it's ready you can use `xpk storage attach` with `--type=parallelstore` command to attach a Parallelstore instance to your cluster. Currently, attaching a Parallelstore is supported only by providing a manifest file. - -```shell -python3 xpk.py storage attach test-parallelstore-storage --type=parallelstore \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ - --mount-point='/test-mount-point' --readonly=false \ - --auto-mount=true \ - --manifest='./examples/storage/parallelstore-manifest-attach.yaml' -``` - -Parameters: - -- `--type` - type of the storage `parallelstore` -- `--auto-mount` - if set to true all workloads will have this storage mounted by default. -- `--mount-point` - the path on which this storage should be mounted for a workload. -- `--readonly` - if set to true, workload can only read from storage. -- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. - -### Block storage (Persistent Disk, Hyperdisk) - -A PersistentDisk adapter lets you mount and access Google Cloud Block storage solutions ([Persistent Disk](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview#pd), [Hyperdisk](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview#hyperdisk)) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. - -To use the GCE PersistentDisk with XPK you need to create a [disk in GCE](https://cloud.google.com/compute/docs/disks). Please consider that the disk type you are creating is [compatible with the VMs](https://cloud.google.com/compute/docs/machine-resource#machine_type_comparison) in the default and accelerator nodepools. - -Once it's ready you can use `xpk storage attach` with `--type=pd` command to attach a PersistentDisk instance to your cluster. Currently, attaching a PersistentDisk is supported only by providing a manifest file. - -```shell -python3 xpk.py storage attach test-pd-storage --type=pd \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ - --mount-point='/test-mount-point' --readonly=false \ - --auto-mount=true \ - --manifest='./examples/storage/pd-manifest-attach.yaml' -``` - -Parameters: - -- `--type` - type of the storage `pd` -- `--auto-mount` - if set to true all workloads will have this storage mounted by default. -- `--mount-point` - the path on which this storage should be mounted for a workload. -- `--readonly` - if set to true, workload can only read from storage. -- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. - -### Managed Lustre - -A Managed Lustre adaptor lets you mount and access [Google Cloud Managed Lustre instances](https://cloud.google.com/kubernetes-engine/docs/concepts/managed-lustre) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. - -To use the GCP Managed Lustre with XPK you need to create [an instance](https://cloud.google.com/managed-lustre/docs/create-instance). Please make sure you enable GKE support when creating the instance (gcloud ex. `--gke-support-enabled`). - -Once it's ready you can use `xpk storage attach` with `--type=lustre` command to attach a Managed Lustre instance to your cluster. Currently, attaching a Managed Lustre instance is supported only by providing a manifest file. - -```shell -python3 xpk.py storage attach test-lustre-storage --type=lustre \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ - --mount-point='/test-mount-point' --readonly=false \ - --auto-mount=true \ - --manifest='./examples/storage/lustre-manifest-attach.yaml' -``` - -Parameters: - -- `--type` - type of the storage `lustre` -- `--auto-mount` - if set to true all workloads will have this storage mounted by default. -- `--mount-point` - the path on which this storage should be mounted for a workload. -- `--readonly` - if set to true, workload can only read from storage. -- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. - -### List attached storages - -```shell -python3 xpk.py storage list \ - --project=$PROJECT --cluster $CLUSTER --zone=$ZONE -``` - -### Running workloads with storage - -If you specified `--auto-mount=true` when creating or attaching a storage, then all workloads deployed on the cluster will have the volume attached by default. Otherwise, in order to have the storage attached, you have to add `--storage` parameter to `workload create` command: - -```shell -python3 xpk.py workload create \ - --workload xpk-test-workload --command "echo goodbye" \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ - --tpu-type=v5litepod-16 --storage=test-storage -``` - -### Detaching storage - -```shell -python3 xpk.py storage detach $STORAGE_NAME \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE -``` - -### Deleting storage - -XPK allows you to remove Filestore instances easily with `xpk storage delete` command. **Warning:** this deletes all data contained in the Filestore! - -```shell -python3 xpk.py storage delete test-fs-instance \ - --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE -``` - -## Workload Create -* Workload Create (submit training job): - - ```shell - python3 xpk.py workload create \ - --workload xpk-test-workload --command "echo goodbye" \ - --cluster xpk-test \ - --tpu-type=v5litepod-16 --project=$PROJECT - ``` -* Workload create(DWS flex with queued provisioning): - ```shell - python3 xpk.py workload create \ - --workload xpk-test-workload --command "echo goodbye" \ - --cluster xpk-test --flex \ - --tpu-type=v5litepod-16 --project=$PROJECT - -* Workload Create for Pathways: - Pathways workload can be submitted using `workload create-pathways` on a Pathways enabled cluster (created with `cluster create-pathways`) - - Pathways workload example: - ```shell - python3 xpk.py workload create-pathways \ - --workload xpk-pw-test \ - --num-slices=1 \ - --tpu-type=v5litepod-16 \ - --cluster xpk-pw-test \ - --docker-name='user-workload' \ - --docker-image= \ - --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory= dataset_path= per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1 enable_single_controller=True' - ``` - - Regular workload can also be submitted on a Pathways enabled cluster (created with `cluster create-pathways`) - - Pathways workload example: - ```shell - python3 xpk.py workload create-pathways \ - --workload xpk-regular-test \ - --num-slices=1 \ - --tpu-type=v5litepod-16 \ - --cluster xpk-pw-test \ - --docker-name='user-workload' \ - --docker-image= \ - --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory= dataset_path= per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1' - ``` - - Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs! - Specify `--headless` with `workload create-pathways` when the user workload is not provided in a docker container. - ```shell - python3 xpk.py workload create-pathways --headless \ - --workload xpk-pw-headless \ - --num-slices=1 \ - --tpu-type=v5litepod-16 \ - --cluster xpk-pw-test - ``` - Executing the command above would provide the address of the proxy that the user job should connect to. - ```shell - kubectl get pods - kubectl port-forward pod/ 29000:29000 - ``` - ```shell - JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 python -c 'import pathwaysutils; import jax; print(jax.devices())' - ``` - Specify `JAX_PLATFORMS=proxy` and `JAX_BACKEND_TARGET=` and `import pathwaysutils` to establish this connection between the user's JAX code and the Pathways proxy. Execute Pathways workloads interactively on Vertex AI notebooks! - -### Set `max-restarts` for production jobs - -* `--max-restarts `: By default, this is 0. This will restart the job "" -times when the job terminates. For production jobs, it is recommended to -increase this to a large number, say 50. Real jobs can be interrupted due to -hardware failures and software updates. We assume your job has implemented -checkpointing so the job restarts near where it was interrupted. - -### Workloads for A3 Ultra, A3 Mega and A4 clusters (GPU machines) -To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). - - -Machine | Device type -:- | :- -A3 Mega | `h100-mega-80gb-8` -A3 Ultra | `h200-141gb-8` -A4 | `b200-8` - -```shell -python3 xpk.py workload create \ - --workload=$WORKLOAD_NAME --command="echo goodbye" \ - --cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \ - --zone=$COMPUTE_ZONE --project=$PROJECT_ID \ - --num-nodes=$WOKRKLOAD_NUM_NODES -``` - -> The docker image flags/arguments introduced in [workloads section](#workload-create) can be used with A3 or A4 machines as well. - -In order to run NCCL test on A3 machines check out [this guide](/examples/nccl/nccl.md). - -### Workload Priority and Preemption -* Set the priority level of your workload with `--priority=LEVEL` - - We have five priorities defined: [`very-low`, `low`, `medium`, `high`, `very-high`]. - The default priority is `medium`. - - Priority determines: - - 1. Order of queued jobs. - - Queued jobs are ordered by - `very-low` < `low` < `medium` < `high` < `very-high` - - 2. Preemption of lower priority workloads. - - A higher priority job will `evict` lower priority jobs. - Evicted jobs are brought back to the queue and will re-hydrate appropriately. - - #### General Example: - ```shell - python3 xpk.py workload create \ - --workload xpk-test-medium-workload --command "echo goodbye" --cluster \ - xpk-test --tpu-type=v5litepod-16 --priority=medium - ``` - -### Create Vertex AI Experiment to upload data to Vertex AI Tensorboard -*Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have -[Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role -assigned to your user account and to the [Compute Engine Service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account) attached to the node pools in the cluster.* - -Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments). - -XPK will create a Vertex AI Experiment in `workload create` command and attach the Vertex AI Tensorboard created for the cluster during `cluster create`. If there is a cluster created before this feature is released, there will be no Vertex AI Tensorboard created for the cluster and `workload create` will fail. Re-run `cluster create` to create a Vertex AI Tensorboard and then run `workload create` again to schedule your workload. - -* Create Vertex AI Experiment with default Experiment name: - -```shell -python3 xpk.py workload create \ ---cluster xpk-test --workload xpk-workload \ ---use-vertex-tensorboard -``` - -will create a Vertex AI Experiment with the name `xpk-test-xpk-workload` (*-*). - -* Create Vertex AI Experiment with user-specified Experiment name: - -```shell -python3 xpk.py workload create \ ---cluster xpk-test --workload xpk-workload \ ---use-vertex-tensorboard --experiment-name=test-experiment -``` - -will create a Vertex AI Experiment with the name `test-experiment`. - -Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how to update your workload to automatically upload logs collected in your Tensorboard directory to the Vertex AI Experiment created by `workload create`. - -## Workload Delete -* Workload Delete (delete training job): - - ```shell - python3 xpk.py workload delete \ - --workload xpk-test-workload --cluster xpk-test - ``` - - This will only delete `xpk-test-workload` workload in `xpk-test` cluster. - -* Workload Delete (delete all training jobs in the cluster): - - ```shell - python3 xpk.py workload delete \ - --cluster xpk-test - ``` - - This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. Multiple workload deletions are processed in batches for optimized processing. - -* Workload Delete supports filtering. Delete a portion of jobs that match user criteria. Multiple workload deletions are processed in batches for optimized processing. - * Filter by Job: `filter-by-job` - - ```shell - python3 xpk.py workload delete \ - --cluster xpk-test --filter-by-job=$USER - ``` - - This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt. - - * Filter by Status: `filter-by-status` - - ```shell - python3 xpk.py workload delete \ - --cluster xpk-test --filter-by-status=QUEUED - ``` - - This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`. - -## Workload List -* Workload List (see training jobs): - - ```shell - python3 xpk.py workload list \ - --cluster xpk-test - ``` - -* Example Workload List Output: - - The below example shows four jobs of different statuses: - - * `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`. - * `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`. - * `user-third-job-running`: **filter-status** is `RUNNING`. - * `user-forth-job-in-queue`: **filter-status** is `QUEUED`. - * `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`. - - ``` - Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time - user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 Finished JobSet failed 2023-1-1T1:05:00Z - user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z - user-third-job-running 2023-1-1T1:15:00Z medium 4 4 Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z - user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z - user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z - ``` - -* Workload List supports filtering. Observe a portion of jobs that match user criteria. - - * Filter by Status: `filter-by-status` - - Filter the workload list by the status of respective jobs. - Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL` - - * Filter by Job: `filter-by-job` - - Filter the workload list by the name of a job. - - ```shell - python3 xpk.py workload list \ - --cluster xpk-test --filter-by-job=$USER - ``` - -* Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once. - - Wait for a job to complete. - - ```shell - python3 xpk.py workload list \ - --cluster xpk-test --wait-for-job-completion=xpk-test-workload - ``` - - Wait for a job to complete with a timeout of 300 seconds. - - ```shell - python3 xpk.py workload list \ - --cluster xpk-test --wait-for-job-completion=xpk-test-workload \ - --timeout=300 - ``` - - Return codes - `0`: Workload finished and completed successfully. - `124`: Timeout was reached before workload finished. - `125`: Workload finished but did not complete successfully. - `1`: Other failure. - -## Job List - -* Job List (see jobs submitted via batch command): - - ```shell - python3 xpk.py job ls --cluster xpk-test - ``` - -* Example Job List Output: - - ``` - NAME PROFILE LOCAL QUEUE COMPLETIONS DURATION AGE - xpk-def-app-profile-slurm-74kbv xpk-def-app-profile 1/1 15s 17h - xpk-def-app-profile-slurm-brcsg xpk-def-app-profile 1/1 9s 3h56m - xpk-def-app-profile-slurm-kw99l xpk-def-app-profile 1/1 5s 3h54m - xpk-def-app-profile-slurm-x99nx xpk-def-app-profile 3/3 29s 17h - ``` - -## Job Cancel - -* Job Cancel (delete job submitted via batch command): - - ```shell - python3 xpk.py job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test - ``` - -## Inspector -* Inspector provides debug info to understand cluster health, and why workloads are not running. -Inspector output is saved to a file. - - ```shell - python3 xpk.py inspector \ - --cluster $CLUSTER_NAME \ - --project $PROJECT_ID \ - --zone $ZONE - ``` - -* Optional Arguments - * `--print-to-terminal`: - Print command output to terminal as well as a file. - * `--workload $WORKLOAD_NAME` - Inspector will write debug info related to the workload:`$WORKLOAD_NAME` - -* Example Output: - - The output of xpk inspector is in `/tmp/tmp0pd6_k1o` in this example. - ```shell - [XPK] Starting xpk - [XPK] Task: `Set Cluster` succeeded. - [XPK] Task: `Local Setup: gcloud version` is implemented by `gcloud version`, hiding output unless there is an error. - [XPK] Task: `Local Setup: Project / Zone / Region` is implemented by `gcloud config get project; gcloud config get compute/zone; gcloud config get compute/region`, hiding output unless there is an error. - [XPK] Task: `GKE: Cluster Details` is implemented by `gcloud beta container clusters list --project $PROJECT --region $REGION | grep -e NAME -e $CLUSTER_NAME`, hiding output unless there is an error. - [XPK] Task: `GKE: Node pool Details` is implemented by `gcloud beta container node-pools list --cluster $CLUSTER_NAME --project=$PROJECT --region=$REGION`, hiding output unless there is an error. - [XPK] Task: `Kubectl: All Nodes` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud\.google\.com/gke-nodepool'`, hiding output unless there is an error. - [XPK] Task: `Kubectl: Number of Nodes per Node Pool` is implemented by `kubectl get node -o custom-columns=':metadata.labels.cloud\.google\.com/gke-nodepool' | sort | uniq -c`, hiding output unless there is an error. - [XPK] Task: `Kubectl: Healthy Node Count Per Node Pool` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud\.google\.com/gke-nodepool' | grep -w True | awk {'print $3'} | sort | uniq -c`, hiding output unless there is an error. - [XPK] Task: `Kueue: ClusterQueue Details` is implemented by `kubectl describe ClusterQueue cluster-queue`, hiding output unless there is an error. - [XPK] Task: `Kueue: LocalQueue Details` is implemented by `kubectl describe LocalQueue multislice-queue`, hiding output unless there is an error. - [XPK] Task: `Kueue: Kueue Deployment Details` is implemented by `kubectl describe Deployment kueue-controller-manager -n kueue-system`, hiding output unless there is an error. - [XPK] Task: `Jobset: Deployment Details` is implemented by `kubectl describe Deployment jobset-controller-manager -n jobset-system`, hiding output unless there is an error. - [XPK] Task: `Kueue Manager Logs` is implemented by `kubectl logs deployment/kueue-controller-manager -n kueue-system --tail=100 --prefix=True`, hiding output unless there is an error. - [XPK] Task: `Jobset Manager Logs` is implemented by `kubectl logs deployment/jobset-controller-manager -n jobset-system --tail=100 --prefix=True`, hiding output unless there is an error. - [XPK] Task: `List Jobs with filter-by-status=EVERYTHING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" `, hiding output unless there is an error. - [XPK] Task: `List Jobs with filter-by-status=QUEUED with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted|QuotaReserved" && ($5 ~ "" || $5 == 0)) {print $0}' `, hiding output unless there is an error. - [XPK] Task: `List Jobs with filter-by-status=RUNNING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted" && $5 ~ /^[0-9]+$/ && $5 > 0) {print $0}' `, hiding output unless there is an error. - [XPK] Find xpk inspector output file: /tmp/tmp0pd6_k1o - [XPK] Exiting XPK cleanly - ``` - -## Run -* `xpk run` lets you execute scripts on a cluster with ease. It automates task execution, handles interruptions, and streams job output to your console. - - ```shell - python xpk.py run --kind-cluster -n 2 -t 0-2 examples/job.sh - ``` - -* Example Output: - - ```shell - [XPK] Starting xpk - [XPK] Task: `get current-context` is implemented by `kubectl config current-context`, hiding output unless there is an error. - [XPK] No local cluster name specified. Using current-context `kind-kind` - [XPK] Task: `run task` is implemented by `kubectl kjob create slurm --profile xpk-def-app-profile --localqueue multislice-queue --wait --rm -- examples/job.sh --partition multislice-queue --ntasks 2 --time 0-2`. Streaming output and input live. - job.batch/xpk-def-app-profile-slurm-g4vr6 created - configmap/xpk-def-app-profile-slurm-g4vr6 created - service/xpk-def-app-profile-slurm-g4vr6 created - Starting log streaming for pod xpk-def-app-profile-slurm-g4vr6-1-4rmgk... - Now processing task ID: 3 - Starting log streaming for pod xpk-def-app-profile-slurm-g4vr6-0-bg6dm... - Now processing task ID: 1 - exit - exit - Now processing task ID: 2 - exit - Job logs streaming finished.[XPK] Task: `run task` terminated with code `0` - [XPK] XPK Done. - ``` - -## GPU usage - -In order to use XPK for GPU, you can do so by using `device-type` flag. - -* Cluster Create (provision reserved capacity): - - ```shell - # Find your reservations - gcloud compute reservations list --project=$PROJECT_ID - - # Run cluster create with reservation. - python3 xpk.py cluster create \ - --cluster xpk-test --device-type=h100-80gb-8 \ - --num-nodes=2 \ - --reservation=$RESERVATION_ID - ``` - -* Cluster Delete (deprovision capacity): - - ```shell - python3 xpk.py cluster delete \ - --cluster xpk-test - ``` - -* Cluster List (see provisioned capacity): - - ```shell - python3 xpk.py cluster list - ``` - -* Cluster Describe (see capacity): - - ```shell - python3 xpk.py cluster describe \ - --cluster xpk-test - ``` - - -* Cluster Cacheimage (enables faster start times): - - ```shell - python3 xpk.py cluster cacheimage \ - --cluster xpk-test --docker-image gcr.io/your_docker_image \ - --device-type=h100-80gb-8 - ``` - - -* [Install NVIDIA GPU device drivers](https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install) - ```shell - # List available driver versions - gcloud compute ssh $NODE_NAME --command "sudo cos-extensions list" - - # Install the default driver - gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu" - # OR install a specific version of the driver - gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu -- -version=DRIVER_VERSION" - ``` - -* Run a workload: - - ```shell - # Submit a workload - python3 xpk.py workload create \ - --cluster xpk-test --device-type h100-80gb-8 \ - --workload xpk-test-workload \ - --command="echo hello world" - ``` - -* Workload Delete (delete training job): - - ```shell - python3 xpk.py workload delete \ - --workload xpk-test-workload --cluster xpk-test - ``` - - This will only delete `xpk-test-workload` workload in `xpk-test` cluster. - -* Workload Delete (delete all training jobs in the cluster): - - ```shell - python3 xpk.py workload delete \ - --cluster xpk-test - ``` - - This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. - -* Workload Delete supports filtering. Delete a portion of jobs that match user criteria. - * Filter by Job: `filter-by-job` - - ```shell - python3 xpk.py workload delete \ - --cluster xpk-test --filter-by-job=$USER - ``` - - This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt. - - * Filter by Status: `filter-by-status` - - ```shell - python3 xpk.py workload delete \ - --cluster xpk-test --filter-by-status=QUEUED - ``` - - This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`. - -## CPU usage - -In order to use XPK for CPU, you can do so by using `device-type` flag. - -* Cluster Create (provision on-demand capacity): - - ```shell - # Run cluster create with on demand capacity. - python3 xpk.py cluster create \ - --cluster xpk-test \ - --device-type=n2-standard-32-256 \ - --num-slices=1 \ - --default-pool-cpu-machine-type=n2-standard-32 \ - --on-demand - ``` - Note that `device-type` for CPUs is of the format -, thus in the above example, user requests for 256 VMs of type n2-standard-32. - Currently workloads using < 1000 VMs are supported. - -* Run a workload: - - ```shell - # Submit a workload - python3 xpk.py workload create \ - --cluster xpk-test \ - --num-slices=1 \ - --device-type=n2-standard-32-256 \ - --workload xpk-test-workload \ - --command="echo hello world" - ``` - -# Autoprovisioning with XPK -XPK can dynamically allocate cluster capacity using [Node Auto Provisioning, (NAP)](https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#use_accelerators_for_new_auto-provisioned_node_pools) support. - -Allow several topology sizes to be supported from one XPK cluster, and be dynamically provisioned based on incoming workload requests. XPK users will not need to re-provision the cluster manually. - -Enabling autoprovisioning will take the cluster around initially up to **30 minutes to upgrade**. - -## Create a cluster with autoprovisioning: - -Autoprovisioning will be enabled on the below cluster with [0, 8] chips of v4 TPU (up to 1xv4-16) to scale -between. - -XPK doesn't currently support different generations of accelerators in the same cluster (like v4 and v5p TPUs). - -```shell -CLUSTER_NAME=my_cluster -NUM_SLICES=2 -DEVICE_TYPE=v4-8 -RESERVATION=reservation_id -PROJECT=my_project -ZONE=us-east5-b - -python3 xpk.py cluster create \ - --cluster $CLUSTER_NAME \ - --num-slices=$NUM_SLICES \ - --device-type=$DEVICE_TYPE \ - --zone=$ZONE \ - --project=$PROJECT \ - --reservation=$RESERVATION \ - --enable-autoprovisioning -``` - -1. Define the starting accelerator configuration and capacity type. - - ```shell - --device-type=$DEVICE_TYPE \ - --num-slice=$NUM_SLICES - ``` -2. Optionally set custom `minimum` / `maximum` chips. NAP will rescale the cluster with `maximum` - `minimum` chips. By default, `maximum` is set to the current cluster configuration size, and `minimum` is set to 0. This allows NAP to rescale with all the resources. - - ```shell - --autoprovisioning-min-chips=$MIN_CHIPS \ - --autoprovisioning-max-chips=$MAX_CHIPS - ``` - -3. `FEATURE TO COME SOON:` Set the timeout period for when node pools will automatically be deleted -if no incoming workloads are run. This is 10 minutes currently. - -4. `FEATURE TO COME:` Set the timeout period to infinity. This will keep the idle node pool configuration always running until updated by new workloads. - -### Update a cluster with autoprovisioning: -```shell -CLUSTER_NAME=my_cluster -NUM_SLICES=2 -DEVICE_TYPE=v4-8 -RESERVATION=reservation_id -PROJECT=my_project -ZONE=us-east5-b - -python3 xpk.py cluster create \ - --cluster $CLUSTER_NAME \ - --num-slices=$NUM_SLICES \ - --device-type=$DEVICE_TYPE \ - --zone=$ZONE \ - --project=$PROJECT \ - --reservation=$RESERVATION \ - --enable-autoprovisioning -``` - -### Update a previously autoprovisioned cluster with different amount of chips: - -* Option 1: By creating a new cluster nodepool configuration. - -```shell -CLUSTER_NAME=my_cluster -NUM_SLICES=2 -DEVICE_TYPE=v4-16 -RESERVATION=reservation_id -PROJECT=my_project -ZONE=us-east5-b - -# This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16. -python3 xpk.py cluster create \ - --cluster $CLUSTER_NAME \ - --num-slices=$NUM_SLICES \ - --device-type=$DEVICE_TYPE \ - --zone=$ZONE \ - --project=$PROJECT \ - --reservation=$RESERVATION \ - --enable-autoprovisioning -``` - -* Option 2: By increasing the `--autoprovisioning-max-chips`. -```shell -CLUSTER_NAME=my_cluster -NUM_SLICES=0 -DEVICE_TYPE=v4-16 -RESERVATION=reservation_id -PROJECT=my_project -ZONE=us-east5-b - -# This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16 -python3 xpk.py cluster create \ - --cluster $CLUSTER_NAME \ - --num-slices=$NUM_SLICES \ - --device-type=$DEVICE_TYPE \ - --zone=$ZONE \ - --project=$PROJECT \ - --reservation=$RESERVATION \ - --enable-autoprovisioning \ - --autoprovisioning-max-chips 16 -``` - -## Run workloads on the cluster with autoprovisioning: -Reconfigure the `--device-type` and `--num-slices` - ```shell - CLUSTER_NAME=my_cluster - NUM_SLICES=2 - DEVICE_TYPE=v4-8 - NEW_RESERVATION=new_reservation_id - PROJECT=my_project - ZONE=us-east5-b - # Create a 2x v4-8 TPU workload. - python3 xpk.py workload create \ - --cluster $CLUSTER \ - --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ - --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ - --device-type=$DEVICE_TYPE \ - --num-slices=$NUM_SLICES \ - --zone=$ZONE \ - --project=$PROJECT - - NUM_SLICES=1 - DEVICE_TYPE=v4-16 - - # Create a 1x v4-16 TPU workload. - python3 xpk.py workload create \ - --cluster $CLUSTER \ - --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ - --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ - --device-type=$DEVICE_TYPE \ - --num-slices=$NUM_SLICES \ - --zone=$ZONE \ - --project=$PROJECT - - # Use a different reservation from what the cluster was created with. - python3 xpk.py workload create \ - --cluster $CLUSTER \ - --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ - --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ - --device-type=$DEVICE_TYPE \ - --num-slices=$NUM_SLICES \ - --zone=$ZONE \ - --project=$PROJECT \ - --reservation=$NEW_RESERVATION - ``` - -1. (Optional) Define the capacity type. By default, the capacity type will -match with what the cluster was created with. - - ```shell - --reservation=my-reservation-id | --on-demand | --spot - ``` - -2. Set the topology of your workload using --device-type. - - ```shell - NUM_SLICES=1 - DEVICE_TYPE=v4-8 - --device-type=$DEVICE_TYPE \ - --num-slices=$NUM_SLICES \ - ``` - - -# How to add docker images to a xpk workload - -The default behavior is `xpk workload create` will layer the local directory (`--script-dir`) into -the base docker image (`--base-docker-image`) and run the workload command. -If you don't want this layering behavior, you can directly use `--docker-image`. Do not mix arguments from the two flows in the same command. - -## Recommended / Default Docker Flow: `--base-docker-image` and `--script-dir` -This flow pulls the `--script-dir` into the `--base-docker-image` and runs the new docker image. - -* The below arguments are optional by default. xpk will pull the local - directory with a generic base docker image. - - - `--base-docker-image` sets the base image that xpk will start with. - - - `--script-dir` sets which directory to pull into the image. This defaults to the current working directory. - - See `python3 xpk.py workload create --help` for more info. - -* Example with defaults which pulls the local directory into the base image: - ```shell - echo -e '#!/bin/bash \n echo "Hello world from a test script!"' > test.sh - python3 xpk.py workload create --cluster xpk-test \ - --workload xpk-test-workload-base-image --command "bash test.sh" \ - --tpu-type=v5litepod-16 --num-slices=1 - ``` - -* Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators): - ```shell - python3 xpk.py workload create --cluster xpk-test \ - --workload xpk-test-workload-base-image --command "bash custom_script.sh" \ - --base-docker-image=gcr.io/your_dependencies_docker_image \ - --tpu-type=v5litepod-16 --num-slices=1 - ``` - -## Optional Direct Docker Image Configuration: `--docker-image` -If a user wants to directly set the docker image used and not layer in the -current working directory, set `--docker-image` to the image to be use in the -workload. - -* Running with `--docker-image`: - ```shell - python3 xpk.py workload create --cluster xpk-test \ - --workload xpk-test-workload-base-image --command "bash test.sh" \ - --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image - ``` - -* Recommended Flow For Large Sized Jobs (more than 10k accelerators): - ```shell - python3 xpk.py cluster cacheimage \ - --cluster xpk-test --docker-image gcr.io/your_docker_image - # Run workload create with the same image. - python3 xpk.py workload create --cluster xpk-test \ - --workload xpk-test-workload-base-image --command "bash test.sh" \ - --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image - ``` - -# More advanced facts: - -* Workload create has two mutually exclusive ways to override the environment of a workload: - * a `--env` flag to specify each environment variable separately. The format is: - - `--env VARIABLE1=value --env VARIABLE2=value` - - * a `--env-file` flag to allow specifying the container's -environment from a file. Usage is the same as Docker's -[--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env) - - Example Env File: - ```shell - LIBTPU_INIT_ARGS=--my-flag=true --performance=high - MY_ENV_VAR=hello - ``` - -* Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket. -Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads -hlo dumps to the specified GCS bucket for each worker. - -# Integration Test Workflows -The repository code is tested through Github Workflows and Actions. Currently three kinds of tests are performed: -* A nightly build that runs every 24 hours -* A build that runs on push to `main` branch -* A build that runs for every PR approval - -More information is documented [here](https://github.com/google/xpk/tree/main/.github/workflows) - -# Troubleshooting - -## `Invalid machine type` for CPUs. -XPK will create a regional GKE cluster. If you see issues like - -```shell -Invalid machine type e2-standard-32 in zone $ZONE_NAME -``` - -Please select a CPU type that exists in all zones in the region. - -```shell -# Find CPU Types supported in zones. -gcloud compute machine-types list --zones=$ZONE_LIST -# Adjust default cpu machine type. -python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ... -``` - -## Workload creation fails - -Some XPK cluster configuration might be missing, if workload creation fails with the below error. - -`[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'` - -Mitigate this error by re-running your `xpk.py cluster create ...` command, to refresh the cluster configurations. - -## Permission Issues: `requires one of ["permission_name"] permission(s)`. - -1) Determine the role needed based on the permission error: - - ```shell - # For example: `requires one of ["container.*"] permission(s)` - # Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user. - ``` - -2) Add the role to the user in your project. - - Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli: - ```shell - PROJECT_ID=my-project-id - CURRENT_GKE_USER=$(gcloud config get account) - ROLE=roles/container.admin # container.admin is the role needed for Kubernetes Engine Admin - gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE - ``` - -3) Check the permissions are correct for the users. - - Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli: - - ```shell - PROJECT_ID=my-project-id - CURRENT_GKE_USER=$(gcloud config get account) - gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members" - ``` - -4) Confirm you have logged in locally with the correct user. - - ```shell - gcloud auth login - ``` - -### Roles needed based on permission errors: - -* `requires one of ["container.*"] permission(s)` - - Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user. - -* `ERROR: (gcloud.monitoring.dashboards.list) User does not have permission to access projects instance (or it may not exist)` - - Add [Monitoring Viewer](https://cloud.google.com/iam/docs/understanding-roles#monitoring.viewer) to your user. - - -## Reservation Troubleshooting: - -### How to determine your reservation and its size / utilization: - -```shell -PROJECT_ID=my-project -ZONE=us-east5-b -RESERVATION=my-reservation-name -# Find the reservations in your project -gcloud beta compute reservations list --project=$PROJECT_ID -# Find the tpu machine type and current utilization of a reservation. -gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE -``` - -## 403 error on workload create when using `--base-docker-image` flag -You need authority to push to the registry from your local machine. Try running `gcloud auth configure-docker`. -## `Kubernetes API exception` - 404 error -If error of this kind appeared after updating xpk version it's possible that you need to rerun `cluster create` command in order to update resource definitions. - -# TPU Workload Debugging - -## Verbose Logging -If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example: -```shell -python3 xpk.py workload create \ ---cluster --workload xpk-test-workload \ ---command="echo hello world" --enable-debug-logs -``` -Please check [libtpu logging](https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf#debug_logs) and [Tensorflow logging](https://deepreg.readthedocs.io/en/latest/docs/logging.html#tensorflow-logging) for more information about the flags that are enabled to get the logs. - -## Collect Stack Traces -[cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/) PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection. -```shell -# main.py - -from cloud_tpu_diagnostics import diagnostic -from cloud_tpu_diagnostics.configuration import debug_configuration -from cloud_tpu_diagnostics.configuration import diagnostic_configuration -from cloud_tpu_diagnostics.configuration import stack_trace_configuration - -stack_trace_config = stack_trace_configuration.StackTraceConfig( - collect_stack_trace = True, - stack_trace_to_cloud = True) -debug_config = debug_configuration.DebugConfig( - stack_trace_config = stack_trace_config) -diagnostic_config = diagnostic_configuration.DiagnosticConfig( - debug_config = debug_config) - -with diagnostic.diagnose(diagnostic_config): - main_method() # this is the main method to run -``` -This configuration will start collecting stack traces inside the `/tmp/debugging` directory on each Kubernetes Pod. - -### Explore Stack Traces -To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory. - ```shell - python3 xpk.py workload create \ - --workload xpk-test-workload --command "python3 main.py" --cluster \ - xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar - ``` - -### Get information about jobs, queues and resources. - -To list available resources and queues use ```xpk info``` command. It allows to see localqueues and clusterqueues and check for available resources. - -To see queues with usage and workload info use: -```shell -python3 xpk.py info --cluster my-cluster -``` - -You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue. -```shell -python3 xpk.py info --cluster my-cluster --localqueue -``` - -# Local testing with Kind - -To facilitate development and testing locally, we have integrated support for testing with `kind`. This enables you to simulate a Kubernetes environment on your local machine. - -## Prerequisites - -- Install kind on your local machine. Follow the official documentation here: [Kind Installation Guide.](https://kind.sigs.k8s.io/docs/user/quick-start#installation) - -## Usage - -xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facilitating the orchestration and management of workloads. Below are the commands for managing clusters: - -### Cluster Create -* Cluster create: - - ```shell - python3 xpk.py kind create \ - --cluster xpk-test - ``` - -### Cluster Delete -* Cluster Delete: - - ```shell - python3 xpk.py kind delete \ - --cluster xpk-test - ``` - -### Cluster List -* Cluster List: - - ```shell - python3 xpk.py kind list - ``` - -## Local Testing Basics - -Local testing is available exclusively through the `batch` and `job` commands of xpk with the `--kind-cluster` flag. This allows you to simulate training jobs locally: - -```shell -python xpk.py batch [other-options] --kind-cluster script -``` - -Please note that all other xpk subcommands are intended for use with cloud systems on Google Cloud Engine (GCE) and don't support local testing. This includes commands like cluster, info, inspector, etc. - -# Other advanced usage -[Use a Jupyter notebook to interact with a Cloud TPU cluster](xpk-notebooks.md) \ -[Use Slurm like commands in XPK to execute workloads on top of GKE](xpk-slurm-commands.md) +XPK also supports [Google Cloud Storage solutions](./docs/usage/storage.md): +* [Cloud Storage FUSE](./docs/usage/storage.md#fuse) +* [Filestore](./docs/usage/storage.md#filestore) +* [Parallelstore](./docs/usage/storage.md#parallelstore) +* [Block storage (Persistent Disk, Hyperdisk)](./docs/usage/storage.md#block-storage-persistent-disk-hyperdisk) + +# Documentation + +* [Permissions](./docs/permissions.md) +* [Installation](./docs/installation.md) +* [Usage](./docs/usage/) + * [Clusters](./docs/usage/clusters.md) + * [Workloads](./docs/usage/workloads.md) + * [Storage](./docs/usage/storage.md) + * [GPU](./docs/usage/gpu.md) + * [CPU](./docs/usage/cpu.md) + * [Autoprovisioning](./docs/usage/autoprovisioning.md) + * [Docker](./docs/usage/docker.md) + * [Advanced](./docs/usage/advanced.md) + * [Inspector](./docs/usage/inspector.md) + * [Run](./docs/usage/run.md) + * [Job](./docs/usage/job.md) +* [Troubleshooting](./docs/troubleshooting.md) +* [Local Testing](./docs/local_testing.md) + +# Contributing + +Please read [`contributing.md`](./docs/contributing.md) for details on our code of conduct, and the process for submitting pull requests to us. + +# License + +This project is licensed under the Apache License 2.0 - see the [`LICENSE`](./LICENSE) file for details \ No newline at end of file diff --git a/docs/installation.md b/docs/installation.md new file mode 100644 index 000000000..f40f5de3a --- /dev/null +++ b/docs/installation.md @@ -0,0 +1,89 @@ + +# Installation + +There are 2 ways to install XPK: + +- via Python package installer (`pip`), +- clone from git and build from source. + +## Prerequisites + +The following tools must be installed: + +- python >= 3.10: download from [here](https://www.python.org/downloads/) +- pip: [installation instructions](https://pip.pypa.io/en/stable/installation/) +- python venv: [installation instructions](https://virtualenv.pypa.io/en/latest/installation.html) +(all three of above can be installed at once from [here](https://packaging.python.org/en/latest/guides/installing-using-linux-tools/#installing-pip-setuptools-wheel-with-linux-package-managers)) +- gcloud: install from [here](https://cloud.google.com/sdk/gcloud#download_and_install_the) and then: + - Run `gcloud init` + - [Authenticate](https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login) to Google Cloud +- kubectl: install from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_kubectl) and then: + - Install `gke-gcloud-auth-plugin` from [here](https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-access-for-kubectl#install_plugin) +- docker: [installation instructions](https://docs.docker.com/engine/install/) and then: + - Configure sudoless docker: [guide](https://docs.docker.com/engine/install/linux-postinstall/) + - Run `gcloud auth configure-docker` to ensure images can be uploaded to registry + +### Additional prerequisites when installing from pip + +- kueuectl: install from [here](https://kueue.sigs.k8s.io/docs/reference/kubectl-kueue/installation/) +- kjob: installation instructions [here](https://github.com/kubernetes-sigs/kjob/blob/main/docs/installation.md) + +### Additional prerequisites when installing from source + +- git: [installation instructions](https://git-scm.com/downloads/linux) +- make: install by running `apt-get -y install make` (`sudo` might be required) + +## Installation via pip + +To install XPK using pip, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-pip). Then you can install XPK simply by running: + +```shell +pip install xpk +``` + +If you see an error saying: `This environment is externally managed`, please use a virtual environment. For example: + +```shell +# One time step of creating the virtual environment +VENV_DIR=~/venvp3 +python3 -m venv $VENV_DIR + +# Activate your virtual environment +source $VENV_DIR/bin/activate + +# Install XPK in virtual environment using pip +pip install xpk +``` + +## Installation from source + +To install XPK from source, first install required tools mentioned in [prerequisites](#prerequisites) and [additional prerequisites](#additional-prerequisites-when-installing-from-source). Afterwards you can install XPK from source using `make` + +```shell +# Clone the XPK repository +git clone https://github.com/google/xpk.git +cd xpk + +# Install required dependencies and build XPK with make +make install && export PATH=$PATH:$PWD/bin +``` + +If you want the dependecies to be available in your PATH please run: `echo $PWD/bin` and add its value to `PATH` in .bashrc or .zshrc file. + +If you see an error saying: `This environment is externally managed`, please use a virtual environment. For example: + +```shell +# One time step of creating the virtual environment +VENV_DIR=~/venvp3 +python3 -m venv $VENV_DIR + +# Activate your virtual environment +source $VENV_DIR/bin/activate + +# Clone the XPK repository +git clone https://github.com/google/xpk.git +cd xpk + +# Install required dependencies and build XPK with make +make install && export PATH=$PATH:$PWD/bin +``` diff --git a/docs/local_testing.md b/docs/local_testing.md new file mode 100644 index 000000000..e88af0f12 --- /dev/null +++ b/docs/local_testing.md @@ -0,0 +1,46 @@ + +# Local testing with Kind + +To facilitate development and testing locally, we have integrated support for testing with `kind`. This enables you to simulate a Kubernetes environment on your local machine. + +## Prerequisites + +- Install kind on your local machine. Follow the official documentation here: [Kind Installation Guide.](https://kind.sigs.k8s.io/docs/user/quick-start#installation) + +## Usage + +xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facilitating the orchestration and management of workloads. Below are the commands for managing clusters: + +### Cluster Create +* Cluster create: + + ```shell + python3 xpk.py kind create \ + --cluster xpk-test + ``` + +### Cluster Delete +* Cluster Delete: + + ```shell + python3 xpk.py kind delete \ + --cluster xpk-test + ``` + +### Cluster List +* Cluster List: + + ```shell + python3 xpk.py kind list + ``` + +## Local Testing Basics + +Local testing is available exclusively through the `batch` and `job` commands of xpk with the `--kind-cluster` flag. This allows you to simulate training jobs locally: + +```shell +python xpk.py batch [other-options] --kind-cluster script +``` + +Please note that all other xpk subcommands are intended for use with cloud systems on Google Cloud Engine (GCE) and don't support local testing. This includes commands like cluster, info, inspector, etc. + diff --git a/docs/permissions.md b/docs/permissions.md new file mode 100644 index 000000000..b26d4da43 --- /dev/null +++ b/docs/permissions.md @@ -0,0 +1,12 @@ + +# Permissions needed on Cloud Console: + +* Artifact Registry Writer +* Compute Admin +* Kubernetes Engine Admin +* Logging Admin +* Monitoring Admin +* Service Account User +* Storage Admin +* Vertex AI Administrator +* Filestore Editor (This role is neccessary if you want to run `storage create` command with `--type=gcpfilestore`) diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md new file mode 100644 index 000000000..50ac70766 --- /dev/null +++ b/docs/troubleshooting.md @@ -0,0 +1,149 @@ + +# Troubleshooting + +## `Invalid machine type` for CPUs. +XPK will create a regional GKE cluster. If you see issues like + +```shell +Invalid machine type e2-standard-32 in zone $ZONE_NAME +``` + +Please select a CPU type that exists in all zones in the region. + +```shell +# Find CPU Types supported in zones. +gcloud compute machine-types list --zones=$ZONE_LIST +# Adjust default cpu machine type. +python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ... +``` + +## Workload creation fails + +Some XPK cluster configuration might be missing, if workload creation fails with the below error. + +`[XPK] b'error: the server doesn\'t have a resource type "workloads"\n'` + +Mitigate this error by re-running your `xpk.py cluster create ...` command, to refresh the cluster configurations. + +## Permission Issues: `requires one of ["permission_name"] permission(s)`. + +1) Determine the role needed based on the permission error: + + ```shell + # For example: `requires one of ["container.*"] permission(s)` + # Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user. + ``` + +2) Add the role to the user in your project. + + Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli: + ```shell + PROJECT_ID=my-project-id + CURRENT_GKE_USER=$(gcloud config get account) + ROLE=roles/container.admin # container.admin is the role needed for Kubernetes Engine Admin + gcloud projects add-iam-policy-binding $PROJECT_ID --member user:$CURRENT_GKE_USER --role=$ROLE + ``` + +3) Check the permissions are correct for the users. + + Go to [iam-admin](https://console.cloud.google.com/iam-admin/) or use gcloud cli: + + ```shell + PROJECT_ID=my-project-id + CURRENT_GKE_USER=$(gcloud config get account) + gcloud projects get-iam-policy $PROJECT_ID --filter="bindings.members:$CURRENT_GKE_USER" --flatten="bindings[].members" + ``` + +4) Confirm you have logged in locally with the correct user. + + ```shell + gcloud auth login + ``` + +### Roles needed based on permission errors: + +* `requires one of ["container.*"] permission(s)` + + Add [Kubernetes Engine Admin](https://cloud.google.com/iam/docs/understanding-roles#kubernetes-engine-roles) to your user. + +* `ERROR: (gcloud.monitoring.dashboards.list) User does not have permission to access projects instance (or it may not exist)` + + Add [Monitoring Viewer](https://cloud.google.com/iam/docs/understanding-roles#monitoring.viewer) to your user. + + +## Reservation Troubleshooting: + +### How to determine your reservation and its size / utilization: + +```shell +PROJECT_ID=my-project +ZONE=us-east5-b +RESERVATION=my-reservation-name +# Find the reservations in your project +gcloud beta compute reservations list --project=$PROJECT_ID +# Find the tpu machine type and current utilization of a reservation. +gcloud beta compute reservations describe $RESERVATION --project=$PROJECT_ID --zone=$ZONE +``` + +## 403 error on workload create when using `--base-docker-image` flag +You need authority to push to the registry from your local machine. Try running `gcloud auth configure-docker`. +## `Kubernetes API exception` - 404 error +If error of this kind appeared after updating xpk version it's possible that you need to rerun `cluster create` command in order to update resource definitions. + +# TPU Workload Debugging + +## Verbose Logging +If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example: +```shell +python3 xpk.py workload create \ +--cluster --workload xpk-test-workload \ +--command="echo hello world" --enable-debug-logs +``` +Please check [libtpu logging](https://cloud.google.com/tpu/docs/troubleshooting/trouble-tf#debug_logs) and [Tensorflow logging](https://deepreg.readthedocs.io/en/latest/logging.html#tensorflow-logging) for more information about the flags that are enabled to get the logs. + +## Collect Stack Traces +[cloud-tpu-diagnostics](https://pypi.org/project/cloud-tpu-diagnostics/) PyPI package can be used to generate stack traces for workloads running in GKE. This package dumps the Python traces when a fault such as segmentation fault, floating-point exception, or illegal operation exception occurs in the program. Additionally, it will also periodically collect stack traces to help you debug situations when the program is unresponsive. You must make the following changes in the docker image running in a Kubernetes main container to enable periodic stack trace collection. +```shell +# main.py + +from cloud_tpu_diagnostics import diagnostic +from cloud_tpu_diagnostics.configuration import debug_configuration +from cloud_tpu_diagnostics.configuration import diagnostic_configuration +from cloud_tpu_diagnostics.configuration import stack_trace_configuration + +stack_trace_config = stack_trace_configuration.StackTraceConfig( + collect_stack_trace = True, + stack_trace_to_cloud = True) +debug_config = debug_configuration.DebugConfig( + stack_trace_config = stack_trace_config) +diagnostic_config = diagnostic_configuration.DiagnosticConfig( + debug_config = debug_config) + +with diagnostic.diagnose(diagnostic_config): + main_method() # this is the main method to run +``` +This configuration will start collecting stack traces inside the `/tmp/debugging` directory on each Kubernetes Pod. + +### Explore Stack Traces +To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory. + ```shell +python3 xpk.py workload create \ + --workload xpk-test-workload --command "python3 main.py" --cluster \ + xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar + ``` + +### Get information about jobs, queues and resources. + +To list available resources and queues use ```xpk info``` command. It allows to see localqueues and clusterqueues and check for available resources. + +To see queues with usage and workload info use: +```shell +python3 xpk.py info --cluster my-cluster +``` + +You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue. +```shell +python3 xpk.py info --cluster my-cluster --localqueue +``` + +``` diff --git a/docs/usage/advanced.md b/docs/usage/advanced.md new file mode 100644 index 000000000..9ae900701 --- /dev/null +++ b/docs/usage/advanced.md @@ -0,0 +1,21 @@ + +# More advanced facts: + +* Workload create has two mutually exclusive ways to override the environment of a workload: + * a `--env` flag to specify each environment variable separately. The format is: + + `--env VARIABLE1=value --env VARIABLE2=value` + + * a `--env-file` flag to allow specifying the container's +environment from a file. Usage is the same as Docker's +[--env-file flag](https://docs.docker.com/engine/reference/commandline/run/#env) + + Example Env File: + ```shell + LIBTPU_INIT_ARGS=--my-flag=true --performance=high + MY_ENV_VAR=hello + ``` + +* Workload create accepts a --debug-dump-gcs flag which is a path to GCS bucket. +Passing this flag sets the XLA_FLAGS='--xla_dump_to=/tmp/xla_dump/' and uploads +hlo dumps to the specified GCS bucket for each worker. diff --git a/docs/usage/autoprovisioning.md b/docs/usage/autoprovisioning.md new file mode 100644 index 000000000..1d5b8af1d --- /dev/null +++ b/docs/usage/autoprovisioning.md @@ -0,0 +1,173 @@ +# Autoprovisioning with XPK +XPK can dynamically allocate cluster capacity using [Node Auto Provisioning, (NAP)](https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#use_accelerators_for_new_auto-provisioned_node_pools) support. + +Allow several topology sizes to be supported from one XPK cluster, and be dynamically provisioned based on incoming workload requests. XPK users will not need to re-provision the cluster manually. + +Enabling autoprovisioning will take the cluster around initially up to **30 minutes to upgrade**. + +## Create a cluster with autoprovisioning: + +Autoprovisioning will be enabled on the below cluster with [0, 8] chips of v4 TPU (up to 1xv4-16) to scale +between. + +XPK doesn't currently support different generations of accelerators in the same cluster (like v4 and v5p TPUs). + +```shell +CLUSTER_NAME=my_cluster +NUM_SLICES=2 +DEVICE_TYPE=v4-8 +RESERVATION=reservation_id +PROJECT=my_project +ZONE=us-east5-b + +python3 xpk.py cluster create \ + --cluster $CLUSTER_NAME \ + --num-slices=$NUM_SLICES \ + --device-type=$DEVICE_TYPE \ + --zone=$ZONE \ + --project=$PROJECT \ + --reservation=$RESERVATION \ + --enable-autoprovisioning +``` + +1. Define the starting accelerator configuration and capacity type. + + ```shell + --device-type=$DEVICE_TYPE \ + --num-slice=$NUM_SLICES + ``` +2. Optionally set custom `minimum` / `maximum` chips. NAP will rescale the cluster with `maximum` - `minimum` chips. By default, `maximum` is set to the current cluster configuration size, and `minimum` is set to 0. This allows NAP to rescale with all the resources. + + ```shell + --autoprovisioning-min-chips=$MIN_CHIPS \ + --autoprovisioning-max-chips=$MAX_CHIPS + ``` + +3. `FEATURE TO COME SOON:` Set the timeout period for when node pools will automatically be deleted +if no incoming workloads are run. This is 10 minutes currently. + +4. `FEATURE TO COME:` Set the timeout period to infinity. This will keep the idle node pool configuration always running until updated by new workloads. + +### Update a cluster with autoprovisioning: +```shell +CLUSTER_NAME=my_cluster +NUM_SLICES=2 +DEVICE_TYPE=v4-8 +RESERVATION=reservation_id +PROJECT=my_project +ZONE=us-east5-b + +python3 xpk.py cluster create \ + --cluster $CLUSTER_NAME \ + --num-slices=$NUM_SLICES \ + --device-type=$DEVICE_TYPE \ + --zone=$ZONE \ + --project=$PROJECT \ + --reservation=$RESERVATION \ + --enable-autoprovisioning +``` + +### Update a previously autoprovisioned cluster with different amount of chips: + +* Option 1: By creating a new cluster nodepool configuration. + +```shell +CLUSTER_NAME=my_cluster +NUM_SLICES=2 +DEVICE_TYPE=v4-16 +RESERVATION=reservation_id +PROJECT=my_project +ZONE=us-east5-b + +# This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16. +python3 xpk.py cluster create \ + --cluster $CLUSTER_NAME \ + --num-slices=$NUM_SLICES \ + --device-type=$DEVICE_TYPE \ + --zone=$ZONE \ + --project=$PROJECT \ + --reservation=$RESERVATION \ + --enable-autoprovisioning +``` + +* Option 2: By increasing the `--autoprovisioning-max-chips`. +```shell +CLUSTER_NAME=my_cluster +NUM_SLICES=0 +DEVICE_TYPE=v4-16 +RESERVATION=reservation_id +PROJECT=my_project +ZONE=us-east5-b + +# This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16 +python3 xpk.py cluster create \ + --cluster $CLUSTER_NAME \ + --num-slices=$NUM_SLICES \ + --device-type=$DEVICE_TYPE \ + --zone=$ZONE \ + --project=$PROJECT \ + --reservation=$RESERVATION \ + --enable-autoprovisioning \ + --autoprovisioning-max-chips 16 +``` + +## Run workloads on the cluster with autoprovisioning: +Reconfigure the `--device-type` and `--num-slices` + ```shell + CLUSTER_NAME=my_cluster +NUM_SLICES=2 +DEVICE_TYPE=v4-8 +NEW_RESERVATION=new_reservation_id +PROJECT=my_project +ZONE=us-east5-b +# Create a 2x v4-8 TPU workload. +python3 xpk.py workload create \ + --cluster $CLUSTER \ + --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ + --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ + --device-type=$DEVICE_TYPE \ + --num-slices=$NUM_SLICES \ + --zone=$ZONE \ + --project=$PROJECT + +NUM_SLICES=1 +DEVICE_TYPE=v4-16 + +# Create a 1x v4-16 TPU workload. +python3 xpk.py workload create \ + --cluster $CLUSTER \ + --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ + --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ + --device-type=$DEVICE_TYPE \ + --num-slices=$NUM_SLICES \ + --zone=$ZONE \ + --project=$PROJECT + +# Use a different reservation from what the cluster was created with. +python3 xpk.py workload create \ + --cluster $CLUSTER \ + --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ + --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ + --device-type=$DEVICE_TYPE \ + --num-slices=$NUM_SLICES \ + --zone=$ZONE \ + --project=$PROJECT \ + --reservation=$NEW_RESERVATION + ``` + +1. (Optional) Define the capacity type. By default, the capacity type will +match with what the cluster was created with. + + ```shell + --reservation=my-reservation-id | --on-demand | --spot + ``` + +2. Set the topology of your workload using --device-type. + + ```shell + NUM_SLICES=1 +DEVICE_TYPE=v4-8 + --device-type=$DEVICE_TYPE \ + --num-slices=$NUM_SLICES \ + ``` + diff --git a/docs/usage/clusters.md b/docs/usage/clusters.md new file mode 100644 index 000000000..54962fecd --- /dev/null +++ b/docs/usage/clusters.md @@ -0,0 +1,261 @@ +## Cluster Create + +First set the project and zone through gcloud config or xpk arguments. + +```shell +PROJECT_ID=my-project-id +ZONE=us-east5-b +# gcloud config: +gcloud config set project $PROJECT_ID +gcloud config set compute/zone $ZONE +# xpk arguments +xpk .. --zone $ZONE --project $PROJECT_ID +``` + +The cluster created is a regional cluster to enable the GKE control plane across +all zones. + +* Cluster Create (provision reserved capacity): + + ```shell + # Find your reservations + gcloud compute reservations list --project=$PROJECT_ID + # Run cluster create with reservation. + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v5litepod-256 \ + --num-slices=2 \ + --reservation=$RESERVATION_ID + ``` + +* Cluster Create (provision on-demand capacity): + + ```shell + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v5litepod-16 \ + --num-slices=4 --on-demand + ``` + +* Cluster Create (provision spot / preemptable capacity): + + ```shell + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v5litepod-16 \ + --num-slices=4 --spot + ``` + +* Cluster Create (DWS flex queued capacity): + ```shell + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v5litepod-16 \ + --num-slices=4 --flex + ``` + +* Cluster Create for Pathways: +Pathways compatible cluster can be created using `cluster create-pathways`. + ```shell + python3 xpk.py cluster create-pathways \ + --cluster xpk-pw-test \ + --num-slices=4 --on-demand \ + --tpu-type=v5litepod-16 + ``` + Note that Pathways clusters need a CPU nodepool of n2-standard-64 or higher. + +* Cluster Create for Ray: + A cluster with KubeRay enabled and a RayCluster can be created using `cluster create-ray`. + ```shell + python3 xpk.py cluster create-ray \ + --cluster xpk-rc-test \ + --ray-version=2.39.0 \ + --num-slices=4 --on-demand \ + --tpu-type=v5litepod-8 + ``` + +* Cluster Create can be called again with the same `--cluster name` to modify + the number of slices or retry failed steps. + + For example, if a user creates a cluster with 4 slices: + + ```shell + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v5litepod-16 \ + --num-slices=4 --reservation=$RESERVATION_ID + ``` + + and recreates the cluster with 8 slices. The command will rerun to create 4 + new slices: + + ```shell + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v5litepod-16 \ + --num-slices=8 --reservation=$RESERVATION_ID + ``` + + and recreates the cluster with 6 slices. The command will rerun to delete 2 + slices. The command will warn the user when deleting slices. + Use `--force` to skip prompts. + + ```shell + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v5litepod-16 \ + --num-slices=6 --reservation=$RESERVATION_ID + + # Skip delete prompts using --force. + + python3 xpk.py cluster create --force \ + --cluster xpk-test --tpu-type=v5litepod-16 \ + --num-slices=6 --reservation=$RESERVATION_ID + ``` + + and recreates the cluster with 4 slices of v4-8. The command will rerun to delete + 6 slices of v5litepod-16 and create 4 slices of v4-8. The command will warn the + user when deleting slices. Use `--force` to skip prompts. + + ```shell + python3 xpk.py cluster create \ + --cluster xpk-test --tpu-type=v4-8 \ + --num-slices=4 --reservation=$RESERVATION_ID + + # Skip delete prompts using --force. + + python3 xpk.py cluster create --force \ + --cluster xpk-test --tpu-type=v4-8 \ + --num-slices=4 --reservation=$RESERVATION_ID + ``` + +### Create Private Cluster + +XPK allows you to create a private GKE cluster for enhanced security. In a private cluster, nodes and pods are isolated from the public internet, providing an additional layer of protection for your workloads. + +To create a private cluster, use the following arguments: + +**`--private`** + +This flag enables the creation of a private GKE cluster. When this flag is set: + +* Nodes and pods are isolated from the direct internet access. +* `master_authorized_networks` is automatically enabled. +* Access to the cluster's control plane is restricted to your current machine's IP address by default. + +**`--authorized-networks`** + +This argument allows you to specify additional IP ranges (in CIDR notation) that are authorized to access the private cluster's control plane and perform `kubectl` commands. + +* Even if this argument is not set when you have `--private`, your current machine's IP address will always be given access to the control plane. +* If this argument is used with an existing private cluster, it will replace the existing authorized networks. + +**Example Usage:** + +* To create a private cluster and allow access to Control Plane only to your current machine: + + ```shell + python3 xpk.py cluster create \ + --cluster=xpk-private-cluster \ + --tpu-type=v4-8 --num-slices=2 \ + --private + ``` + +* To create a private cluster and allow access to Control Plane only to your current machine and the IP ranges `1.2.3.0/24` and `1.2.4.5/32`: + + ```shell + python3 xpk.py cluster create \ + --cluster=xpk-private-cluster \ + --tpu-type=v4-8 --num-slices=2 \ + --authorized-networks 1.2.3.0/24 1.2.4.5/32 + + # --private is optional when you set --authorized-networks + ``` + +> **Important Notes:** +> * The argument `--private` is only applicable when creating new clusters. You cannot convert an existing public cluster to a private cluster using these flags. +> * The argument `--authorized-networks` is applicable when creating new clusters or using an existing _*private*_ cluster. You cannot convert an existing public cluster to a private cluster using these flags. +> * You need to [set up a Cluster NAT for your VPC network](https://cloud.google.com/nat/docs/set-up-manage-network-address-translation#creating_nat) so that the Nodes and Pods have outbound access to the internet. This is required because XPK installs and configures components such as kueue that need access to external sources like `registry.k8.io`. + + +### Create Vertex AI Tensorboard +*Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have +[Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role +assigned to your user account.* + +Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Note that Vertex AI Tensorboard is only available in [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions. + +You can create a Vertex AI Tensorboard for your cluster with `Cluster Create` command. XPK will create a single Vertex AI Tensorboard instance per cluster. + +* Create Vertex AI Tensorboard in default region with default Tensorboard name: + +```shell +python3 xpk.py cluster create \ +--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ +--create-vertex-tensorboard +``` + +will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*-tb-instance*) in `us-central1` (*default region*). + +* Create Vertex AI Tensorboard in user-specified region with default Tensorboard name: + +```shell +python3 xpk.py cluster create \ +--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ +--create-vertex-tensorboard --tensorboard-region=us-west1 +``` + +will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (*-tb-instance*) in `us-west1`. + +* Create Vertex AI Tensorboard in default region with user-specified Tensorboard name: + +```shell +python3 xpk.py cluster create \ +--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ +--create-vertex-tensorboard --tensorboard-name=tb-testing +``` + +will create a Vertex AI Tensorboard with the name `tb-testing` in `us-central1`. + +* Create Vertex AI Tensorboard in user-specified region with user-specified Tensorboard name: + +```shell +python3 xpk.py cluster create \ +--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ +--create-vertex-tensorboard --tensorboard-region=us-west1 --tensorboard-name=tb-testing +``` + +will create a Vertex AI Tensorboard instance with the name `tb-testing` in `us-west1`. + +* Create Vertex AI Tensorboard in an unsupported region: + +```shell +python3 xpk.py cluster create \ +--cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ +--create-vertex-tensorboard --tensorboard-region=us-central2 +``` + +will fail the cluster creation process because Vertex AI Tensorboard is not supported in `us-central2`. + +## Cluster Delete +* Cluster Delete (deprovision capacity): + + ```shell + python3 xpk.py cluster delete \ + --cluster xpk-test + ``` +## Cluster List +* Cluster List (see provisioned capacity): + + ```shell + python3 xpk.py cluster list + ``` +## Cluster Describe +* Cluster Describe (see capacity): + + ```shell + python3 xpk.py cluster describe \ + --cluster xpk-test + ``` + +## Cluster Cacheimage +* Cluster Cacheimage (enables faster start times): + + ```shell + python3 xpk.py cluster cacheimage \ + --cluster xpk-test --docker-image gcr.io/your_docker_image \ + --tpu-type=v5litepod-16 + ``` diff --git a/docs/usage/cpu.md b/docs/usage/cpu.md new file mode 100644 index 000000000..9197bdf95 --- /dev/null +++ b/docs/usage/cpu.md @@ -0,0 +1,30 @@ +## CPU usage + +In order to use XPK for CPU, you can do so by using `device-type` flag. + +* Cluster Create (provision on-demand capacity): + + ```shell + # Run cluster create with on demand capacity. + python3 xpk.py cluster create \ + --cluster xpk-test \ + --device-type=n2-standard-32-256 \ + --num-slices=1 \ + --default-pool-cpu-machine-type=n2-standard-32 \ + --on-demand + ``` + Note that `device-type` for CPUs is of the format -, thus in the above example, user requests for 256 VMs of type n2-standard-32. + Currently workloads using < 1000 VMs are supported. + +* Run a workload: + + ```shell + # Submit a workload + python3 xpk.py workload create \ + --cluster xpk-test \ + --num-slices=1 \ + --device-type=n2-standard-32-256 \ + --workload xpk-test-workload \ + --command="echo hello world" + ``` + diff --git a/docs/usage/docker.md b/docs/usage/docker.md new file mode 100644 index 000000000..0c193cb9a --- /dev/null +++ b/docs/usage/docker.md @@ -0,0 +1,56 @@ +# How to add docker images to a xpk workload + +The default behavior is `xpk workload create` will layer the local directory (`--script-dir`) into +the base docker image (`--base-docker-image`) and run the workload command. +If you don't want this layering behavior, you can directly use `--docker-image`. Do not mix arguments from the two flows in the same command. + +## Recommended / Default Docker Flow: `--base-docker-image` and `--script-dir` +This flow pulls the `--script-dir` into the `--base-docker-image` and runs the new docker image. + +* The below arguments are optional by default. xpk will pull the local + directory with a generic base docker image. + + - `--base-docker-image` sets the base image that xpk will start with. + + - `--script-dir` sets which directory to pull into the image. This defaults to the current working directory. + + See `python3 xpk.py workload create --help` for more info. + +* Example with defaults which pulls the local directory into the base image: + ```shell + echo -e '#!/bin/bash + echo "Hello world from a test script!"' > test.sh +python3 xpk.py workload create --cluster xpk-test \ +--workload xpk-test-workload-base-image --command "bash test.sh" \ +--tpu-type=v5litepod-16 --num-slices=1 + ``` + +* Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators): + ```shell + python3 xpk.py workload create --cluster xpk-test \ +--workload xpk-test-workload-base-image --command "bash custom_script.sh" \ +--base-docker-image=gcr.io/your_dependencies_docker_image \ +--tpu-type=v5litepod-16 --num-slices=1 + ``` + +## Optional Direct Docker Image Configuration: `--docker-image` +If a user wants to directly set the docker image used and not layer in the +current working directory, set `--docker-image` to the image to be use in the +workload. + +* Running with `--docker-image`: + ```shell + python3 xpk.py workload create --cluster xpk-test \ +--workload xpk-test-workload-base-image --command "bash test.sh" \ +--tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image + ``` + +* Recommended Flow For Large Sized Jobs (more than 10k accelerators): + ```shell + python3 xpk.py cluster cacheimage \ +--cluster xpk-test --docker-image gcr.io/your_docker_image +# Run workload create with the same image. +python3 xpk.py workload create --cluster xpk-test \ +--workload xpk-test-workload-base-image --command "bash test.sh" \ +--tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image + ``` diff --git a/docs/usage/gpu.md b/docs/usage/gpu.md new file mode 100644 index 000000000..0e6072074 --- /dev/null +++ b/docs/usage/gpu.md @@ -0,0 +1,104 @@ +## GPU usage + +In order to use XPK for GPU, you can do so by using `device-type` flag. + +* Cluster Create (provision reserved capacity): + + ```shell + # Find your reservations + gcloud compute reservations list --project=$PROJECT_ID + + # Run cluster create with reservation. + python3 xpk.py cluster create \ + --cluster xpk-test --device-type=h100-80gb-8 \ + --num-nodes=2 \ + --reservation=$RESERVATION_ID + ``` + +* Cluster Delete (deprovision capacity): + + ```shell + python3 xpk.py cluster delete \ + --cluster xpk-test + ``` + +* Cluster List (see provisioned capacity): + + ```shell + python3 xpk.py cluster list + ``` + +* Cluster Describe (see capacity): + + ```shell + python3 xpk.py cluster describe \ + --cluster xpk-test + ``` + + +* Cluster Cacheimage (enables faster start times): + + ```shell + python3 xpk.py cluster cacheimage \ + --cluster xpk-test --docker-image gcr.io/your_docker_image \ + --device-type=h100-80gb-8 + ``` + + +* [Install NVIDIA GPU device drivers](https://cloud.google.com/container-optimized-os/docs/how-to/run-gpus#install) + ```shell + # List available driver versions + gcloud compute ssh $NODE_NAME --command "sudo cos-extensions list" + + # Install the default driver + gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu" + # OR install a specific version of the driver + gcloud compute ssh $NODE_NAME --command "sudo cos-extensions install gpu -- -version=DRIVER_VERSION" + ``` + +* Run a workload: + + ```shell + # Submit a workload + python3 xpk.py workload create \ + --cluster xpk-test --device-type h100-80gb-8 \ + --workload xpk-test-workload \ + --command="echo hello world" + ``` + +* Workload Delete (delete training job): + + ```shell + python3 xpk.py workload delete \ + --workload xpk-test-workload --cluster xpk-test + ``` + + This will only delete `xpk-test-workload` workload in `xpk-test` cluster. + +* Workload Delete (delete all training jobs in the cluster): + + ```shell + python3 xpk.py workload delete \ + --cluster xpk-test + ``` + + This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. + +* Workload Delete supports filtering. Delete a portion of jobs that match user criteria. + * Filter by Job: `filter-by-job` + + ```shell + python3 xpk.py workload delete \ + --cluster xpk-test --filter-by-job=$USER + ``` + + This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt. + + * Filter by Status: `filter-by-status` + + ```shell + python3 xpk.py workload delete \ + --cluster xpk-test --filter-by-status=QUEUED + ``` + + This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`. diff --git a/docs/usage/inspector.md b/docs/usage/inspector.md new file mode 100644 index 000000000..cc2e79122 --- /dev/null +++ b/docs/usage/inspector.md @@ -0,0 +1,42 @@ +## Inspector +* Inspector provides debug info to understand cluster health, and why workloads are not running. +Inspector output is saved to a file. + + ```shell + python3 xpk.py inspector \ + --cluster $CLUSTER_NAME \ + --project $PROJECT_ID \ + --zone $ZONE + ``` + +* Optional Arguments + * `--print-to-terminal`: + Print command output to terminal as well as a file. + * `--workload $WORKLOAD_NAME` + Inspector will write debug info related to the workload:`$WORKLOAD_NAME` + +* Example Output: + + The output of xpk inspector is in `/tmp/tmp0pd6_k1o` in this example. + ```shell + [XPK] Starting xpk + [XPK] Task: `Set Cluster` succeeded. + [XPK] Task: `Local Setup: gcloud version` is implemented by `gcloud version`, hiding output unless there is an error. + [XPK] Task: `Local Setup: Project / Zone / Region` is implemented by `gcloud config get project; gcloud config get compute/zone; gcloud config get compute/region`, hiding output unless there is an error. + [XPK] Task: `GKE: Cluster Details` is implemented by `gcloud beta container clusters list --project $PROJECT --region $REGION | grep -e NAME -e $CLUSTER_NAME`, hiding output unless there is an error. + [XPK] Task: `GKE: Node pool Details` is implemented by `gcloud beta container node-pools list --cluster $CLUSTER_NAME --project=$PROJECT --region=$REGION`, hiding output unless there is an error. + [XPK] Task: `Kubectl: All Nodes` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud.google.com/gke-nodepool'`, hiding output unless there is an error. + [XPK] Task: `Kubectl: Number of Nodes per Node Pool` is implemented by `kubectl get node -o custom-columns=':metadata.labels.cloud.google.com/gke-nodepool' | sort | uniq -c`, hiding output unless there is an error. + [XPK] Task: `Kubectl: Healthy Node Count Per Node Pool` is implemented by `kubectl get node -o custom-columns='NODE_NAME:metadata.name, READY_STATUS:.status.conditions[?(@.type=="Ready")].status, NODEPOOL:metadata.labels.cloud.google.com/gke-nodepool' | grep -w True | awk {'print $3'} | sort | uniq -c`, hiding output unless there is an error. + [XPK] Task: `Kueue: ClusterQueue Details` is implemented by `kubectl describe ClusterQueue cluster-queue`, hiding output unless there is an error. + [XPK] Task: `Kueue: LocalQueue Details` is implemented by `kubectl describe LocalQueue multislice-queue`, hiding output unless there is an error. + [XPK] Task: `Kueue: Kueue Deployment Details` is implemented by `kubectl describe Deployment kueue-controller-manager -n kueue-system`, hiding output unless there is an error. + [XPK] Task: `Jobset: Deployment Details` is implemented by `kubectl describe Deployment jobset-controller-manager -n jobset-system`, hiding output unless there is an error. + [XPK] Task: `Kueue Manager Logs` is implemented by `kubectl logs deployment/kueue-controller-manager -n kueue-system --tail=100 --prefix=True`, hiding output unless there is an error. + [XPK] Task: `Jobset Manager Logs` is implemented by `kubectl logs deployment/jobset-controller-manager -n jobset-system --tail=100 --prefix=True`, hiding output unless there is an error. + [XPK] Task: `List Jobs with filter-by-status=EVERYTHING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" `, hiding output unless there is an error. + [XPK] Task: `List Jobs with filter-by-status=QUEUED with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted|QuotaReserved" && ($5 ~ "" || $5 == 0)) {print $0}' `, hiding output unless there is an error. + [XPK] Task: `List Jobs with filter-by-status=RUNNING with filter-by-jobs=None` is implemented by `kubectl get workloads -o=custom-columns="Jobset Name:.metadata.ownerReferences[0].name,Created Time:.metadata.creationTimestamp,Priority:.spec.priorityClassName,TPU VMs Needed:.spec.podSets[0].count,TPU VMs Running/Ran:.status.admission.podSetAssignments[-1].count,TPU VMs Done:.status.reclaimablePods[0].count,Status:.status.conditions[-1].type,Status Message:.status.conditions[-1].message,Status Time:.status.conditions[-1].lastTransitionTime" | awk -e 'NR == 1 || ($7 ~ "Admitted|Evicted" && $5 ~ /^[0-9]+$/ && $5 > 0) {print $0}' `, hiding output unless there is an error. + [XPK] Find xpk inspector output file: /tmp/tmp0pd6_k1o + [XPK] Exiting XPK cleanly + ``` diff --git a/docs/usage/job.md b/docs/usage/job.md new file mode 100644 index 000000000..57d326e8a --- /dev/null +++ b/docs/usage/job.md @@ -0,0 +1,26 @@ + +## Job List + +* Job List (see jobs submitted via batch command): + + ```shell + python3 xpk.py job ls --cluster xpk-test + ``` + +* Example Job List Output: + + ``` + NAME PROFILE LOCAL QUEUE COMPLETIONS DURATION AGE + xpk-def-app-profile-slurm-74kbv xpk-def-app-profile 1/1 15s 17h + xpk-def-app-profile-slurm-brcsg xpk-def-app-profile 1/1 9s 3h56m + xpk-def-app-profile-slurm-kw99l xpk-def-app-profile 1/1 5s 3h54m + xpk-def-app-profile-slurm-x99nx xpk-def-app-profile 3/3 29s 17h + ``` + +## Job Cancel + +* Job Cancel (delete job submitted via batch command): + + ```shell + python3 xpk.py job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test + ``` diff --git a/docs/usage/run.md b/docs/usage/run.md new file mode 100644 index 000000000..dc25a5a62 --- /dev/null +++ b/docs/usage/run.md @@ -0,0 +1,29 @@ + +## Run +* `xpk run` lets you execute scripts on a cluster with ease. It automates task execution, handles interruptions, and streams job output to your console. + + ```shell + python xpk.py run --kind-cluster -n 2 -t 0-2 examples/job.sh + ``` + +* Example Output: + + ```shell + [XPK] Starting xpk + [XPK] Task: `get current-context` is implemented by `kubectl config current-context`, hiding output unless there is an error. + [XPK] No local cluster name specified. Using current-context `kind-kind` + [XPK] Task: `run task` is implemented by `kubectl kjob create slurm --profile xpk-def-app-profile --localqueue multislice-queue --wait --rm -- examples/job.sh --partition multislice-queue --ntasks 2 --time 0-2`. Streaming output and input live. + job.batch/xpk-def-app-profile-slurm-g4vr6 created + configmap/xpk-def-app-profile-slurm-g4vr6 created + service/xpk-def-app-profile-slurm-g4vr6 created + Starting log streaming for pod xpk-def-app-profile-slurm-g4vr6-1-4rmgk... + Now processing task ID: 3 + Starting log streaming for pod xpk-def-app-profile-slurm-g4vr6-0-bg6dm... + Now processing task ID: 1 + exit + exit + Now processing task ID: 2 + exit + Job logs streaming finished.[XPK] Task: `run task` terminated with code `0` + [XPK] XPK Done. + ``` diff --git a/docs/usage/storage.md b/docs/usage/storage.md new file mode 100644 index 000000000..1895f1cc0 --- /dev/null +++ b/docs/usage/storage.md @@ -0,0 +1,175 @@ +## Storage +Currently XPK supports the below types of storages: +- [Cloud Storage FUSE](#fuse) +- [Google Cloud Filestore](#filestore) +- [Google Cloud Parallelstore](#parallelstore) +- [Google Cloud Block storages (Persistent Disk, Hyperdisk)](#block-storage-persistent-disk-hyperdisk) +- [Google Cloud Managed Lustre](#managed-lustre) + +### FUSE +A FUSE adapter lets you mount and access Cloud Storage buckets as local file systems, so workloads can read and write objects in your bucket using standard file system semantics. + +To use the GCS FUSE with XPK you need to create a [Storage Bucket](https://console.cloud.google.com/storage/). + +Once it's ready you can use `xpk storage attach` with `--type=gcsfuse` command to attach a FUSE storage instance to your cluster: + +```shell +python3 xpk.py storage attach test-fuse-storage --type=gcsfuse \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE + --mount-point='/test-mount-point' --readonly=false \ + --bucket=test-bucket --size=1 --auto-mount=false +``` + +Parameters: + +- `--type` - type of the storage, currently xpk supports `gcsfuse` and `gcpfilestore` only. +- `--auto-mount` - if set to true all workloads will have this storage mounted by default. +- `--mount-point` - the path on which this storage should be mounted for a workload. +- `--readonly` - if set to true, workload can only read from storage. +- `--size` - size of the storage in Gb. +- `--bucket` - name of the storage bucket. If not set then the name of the storage is used as a bucket name. +- `--mount-options` - comma-separated list of additional mount options for PersistentVolume ([reference](https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf#mount-options)). +- `--prefetch-metadata` - enables metadata pre-population when mounting the volume by setting parameter `gcsfuseMetadataPrefetchOnMount` to `true` ([reference](https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-storage-fuse-csi-driver-perf#metadata-prefetch)). +- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. If set, then values from manifest override the following parameters: `--size` and `--bucket`. + +### Filestore + +A Filestore adapter lets you mount and access [Filestore instances](https://cloud.google.com/filestore/) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. + +To create and attach a GCP Filestore instance to your cluster use `xpk storage create` command with `--type=gcpfilestore`: + +```shell +python3 xpk.py storage create test-fs-storage --type=gcpfilestore \ + --auto-mount=false --mount-point=/data-fs --readonly=false \ + --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE +``` + +You can also attach an existing Filestore instance to your cluster using `xpk storage attach` command: + +```shell +python3 xpk.py storage attach test-fs-storage --type=gcpfilestore \ + --auto-mount=false --mount-point=/data-fs --readonly=false \ + --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE +``` + +The command above is also useful when attaching multiple volumes from the same Filestore instance. + +Commands `xpk storage create` and `xpk storage attach` with `--type=gcpfilestore` accept following arguments: +- `--type` - type of the storage. +- `--auto-mount` - if set to true all workloads will have this storage mounted by default. +- `--mount-point` - the path on which this storage should be mounted for a workload. +- `--readonly` - if set to true, workload can only read from storage. +- `--size` - size of the Filestore instance that will be created in Gb. +- `--tier` - tier of the Filestore instance that will be created. Possible options are: `[BASIC_HDD, BASIC_SSD, ZONAL, REGIONAL, ENTERPRISE]` +- `--access-mode` - access mode of the Filestore instance that will be created. Possible values are: `[ReadWriteOnce, ReadOnlyMany, ReadWriteMany]` +- `--vol` - file share name of the Filestore instance that will be created. +- `--instance` - the name of the Filestore instance. If not set then the name parameter is used as an instance name. Useful when connecting multiple volumes from the same Filestore instance. +- `--manifest` - path to the manifest file containing PersistentVolume, PresistentVolumeClaim and StorageClass definitions. If set, then values from manifest override the following parameters: `--access-mode`, `--size` and `--volume`. + +### Parallelstore + +A Parallelstore adapter lets you mount and access [Parallelstore instances](https://cloud.google.com/parallelstore/) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. + +To use the GCS Parallelstore with XPK you need to create a [Parallelstore Instance](https://console.cloud.google.com/parallelstore/). + +Once it's ready you can use `xpk storage attach` with `--type=parallelstore` command to attach a Parallelstore instance to your cluster. Currently, attaching a Parallelstore is supported only by providing a manifest file. + +```shell +python3 xpk.py storage attach test-parallelstore-storage --type=parallelstore \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ + --mount-point='/test-mount-point' --readonly=false \ + --auto-mount=true \ + --manifest='./examples/storage/parallelstore-manifest-attach.yaml' +``` + +Parameters: + +- `--type` - type of the storage `parallelstore` +- `--auto-mount` - if set to true all workloads will have this storage mounted by default. +- `--mount-point` - the path on which this storage should be mounted for a workload. +- `--readonly` - if set to true, workload can only read from storage. +- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. + +### Block storage (Persistent Disk, Hyperdisk) + +A PersistentDisk adapter lets you mount and access Google Cloud Block storage solutions ([Persistent Disk](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview#pd), [Hyperdisk](https://cloud.google.com/kubernetes-engine/docs/concepts/storage-overview#hyperdisk)) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. + +To use the GCE PersistentDisk with XPK you need to create a [disk in GCE](https://cloud.google.com/compute/docs/disks). Please consider that the disk type you are creating is [compatible with the VMs](https://cloud.google.com/compute/docs/machine-resource#machine_type_comparison) in the default and accelerator nodepools. + +Once it's ready you can use `xpk storage attach` with `--type=pd` command to attach a PersistentDisk instance to your cluster. Currently, attaching a PersistentDisk is supported only by providing a manifest file. + +```shell +python3 xpk.py storage attach test-pd-storage --type=pd \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ + --mount-point='/test-mount-point' --readonly=false \ + --auto-mount=true \ + --manifest='./examples/storage/pd-manifest-attach.yaml' +``` + +Parameters: + +- `--type` - type of the storage `pd` +- `--auto-mount` - if set to true all workloads will have this storage mounted by default. +- `--mount-point` - the path on which this storage should be mounted for a workload. +- `--readonly` - if set to true, workload can only read from storage. +- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. + +### Managed Lustre + +A Managed Lustre adaptor lets you mount and access [Google Cloud Managed Lustre instances](https://cloud.google.com/managed-lustre) as local file systems, so workloads can read and write files in your volumes using standard file system semantics. + +To use the GCP Managed Lustre with XPK you need to create [an instance](https://cloud.google.com/managed-lustre/docs/create-instance). Please make sure you enable GKE support when creating the instance (gcloud ex. `--gke-support-enabled`). + +Once it's ready you can use `xpk storage attach` with `--type=lustre` command to attach a Managed Lustre instance to your cluster. Currently, attaching a Managed Lustre instance is supported only by providing a manifest file. + +```shell +python3 xpk.py storage attach test-lustre-storage --type=lustre \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ + --mount-point='/test-mount-point' --readonly=false \ + --auto-mount=true \ + --manifest='./examples/storage/lustre-manifest-attach.yaml' +``` + +Parameters: + +- `--type` - type of the storage `lustre` +- `--auto-mount` - if set to true all workloads will have this storage mounted by default. +- `--mount-point` - the path on which this storage should be mounted for a workload. +- `--readonly` - if set to true, workload can only read from storage. +- `--manifest` - path to the manifest file containing PersistentVolume and PresistentVolumeClaim definitions. + +### List attached storages + +```shell +python3 xpk.py storage list \ + --project=$PROJECT --cluster $CLUSTER --zone=$ZONE +``` + +### Running workloads with storage + +If you specified `--auto-mount=true` when creating or attaching a storage, then all workloads deployed on the cluster will have the volume attached by default. Otherwise, in order to have the storage attached, you have to add `--storage` parameter to `workload create` command: + +```shell +python3 xpk.py workload create \ + --workload xpk-test-workload --command "echo goodbye" \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ + --tpu-type=v5litepod-16 --storage=test-storage +``` + +### Detaching storage + +```shell +python3 xpk.py storage detach $STORAGE_NAME \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE +``` + +### Deleting storage + +XPK allows you to remove Filestore instances easily with `xpk storage delete` command. **Warning:** this deletes all data contained in the Filestore! + +```shell +python3 xpk.py storage delete test-fs-instance \ + --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE +``` diff --git a/docs/usage/workloads.md b/docs/usage/workloads.md new file mode 100644 index 000000000..768b57786 --- /dev/null +++ b/docs/usage/workloads.md @@ -0,0 +1,252 @@ +## Workload Create +* Workload Create (submit training job): + + ```shell + python3 xpk.py workload create \ + --workload xpk-test-workload --command "echo goodbye" \ + --cluster xpk-test \ + --tpu-type=v5litepod-16 --project=$PROJECT + ``` +* Workload create(DWS flex with queued provisioning): + ```shell + python3 xpk.py workload create \ + --workload xpk-test-workload --command "echo goodbye" \ + --cluster xpk-test --flex \ + --tpu-type=v5litepod-16 --project=$PROJECT + +* Workload Create for Pathways: + Pathways workload can be submitted using `workload create-pathways` on a Pathways enabled cluster (created with `cluster create-pathways`) + + Pathways workload example: + ```shell + python3 xpk.py workload create-pathways \ + --workload xpk-pw-test \ + --num-slices=1 \ + --tpu-type=v5litepod-16 \ + --cluster xpk-pw-test \ + --docker-name='user-workload' \ + --docker-image= \ + --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory= dataset_path= per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1 enable_single_controller=True' + ``` + + Regular workload can also be submitted on a Pathways enabled cluster (created with `cluster create-pathways`) + + Pathways workload example: + ```shell + python3 xpk.py workload create-pathways \ + --workload xpk-regular-test \ + --num-slices=1 \ + --tpu-type=v5litepod-16 \ + --cluster xpk-pw-test \ + --docker-name='user-workload' \ + --docker-image= \ + --command='python3 -m MaxText.train MaxText/configs/base.yml base_output_directory= dataset_path= per_device_batch_size=1 enable_checkpointing=false enable_profiler=false remat_policy=full global_parameter_scale=4 steps=300 max_target_length=2048 use_iota_embed=true reuse_example_batch=1 dataset_type=synthetic attention=flash gcs_metrics=True run_name=$(USER)-pw-xpk-test-1' + ``` + + Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs! + Specify `--headless` with `workload create-pathways` when the user workload is not provided in a docker container. + ```shell + python3 xpk.py workload create-pathways --headless \ + --workload xpk-pw-headless \ + --num-slices=1 \ + --tpu-type=v5litepod-16 \ + --cluster xpk-pw-test + ``` + Executing the command above would provide the address of the proxy that the user job should connect to. + ```shell + kubectl get pods + kubectl port-forward pod/ 29000:29000 + ``` + ```shell + JAX_PLATFORMS=proxy JAX_BACKEND_TARGET=grpc://127.0.0.1:29000 python -c 'import pathwaysutils; import jax; print(jax.devices())' + ``` + Specify `JAX_PLATFORMS=proxy` and `JAX_BACKEND_TARGET=` and `import pathwaysutils` to establish this connection between the user's JAX code and the Pathways proxy. Execute Pathways workloads interactively on Vertex AI notebooks! + +### Set `max-restarts` for production jobs + +* `--max-restarts `: By default, this is 0. This will restart the job "" times when the job terminates. For production jobs, it is recommended to +increase this to a large number, say 50. Real jobs can be interrupted due to +hardware failures and software updates. We assume your job has implemented +checkpointing so the job restarts near where it was interrupted. + +### Workloads for A3 Ultra, A3 Mega and A4 clusters (GPU machines) +To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). + + +Machine | Device type +:- | :- +A3 Mega | `h100-mega-80gb-8` +A3 Ultra | `h200-141gb-8` +A4 | `b200-8` + +```shell +python3 xpk.py workload create \ + --workload=$WORKLOAD_NAME --command="echo goodbye" \ + --cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \ + --zone=$COMPUTE_ZONE --project=$PROJECT_ID \ + --num-nodes=$WOKRKLOAD_NUM_NODES +``` + +> The docker image flags/arguments introduced in [workloads section](#workload-create) can be used with A3 or A4 machines as well. + +In order to run NCCL test on A3 machines check out [this guide](/examples/nccl/nccl.md). + +### Workload Priority and Preemption +* Set the priority level of your workload with `--priority=LEVEL` + + We have five priorities defined: [`very-low`, `low`, `medium`, `high`, `very-high`]. + The default priority is `medium`. + + Priority determines: + + 1. Order of queued jobs. + + Queued jobs are ordered by + `very-low` < `low` < `medium` < `high` < `very-high` + + 2. Preemption of lower priority workloads. + + A higher priority job will `evict` lower priority jobs. + Evicted jobs are brought back to the queue and will re-hydrate appropriately. + + #### General Example: + ```shell + python3 xpk.py workload create \ + --workload xpk-test-medium-workload --command "echo goodbye" --cluster \ + xpk-test --tpu-type=v5litepod-16 --priority=medium + ``` + +### Create Vertex AI Experiment to upload data to Vertex AI Tensorboard +*Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have +[Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role +assigned to your user account and to the [Compute Engine Service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account) attached to the node pools in the cluster.* + +Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments). + +XPK will create a Vertex AI Experiment in `workload create` command and attach the Vertex AI Tensorboard created for the cluster during `cluster create`. If there is a cluster created before this feature is released, there will be no Vertex AI Tensorboard created for the cluster and `workload create` will fail. Re-run `cluster create` to create a Vertex AI Tensorboard and then run `workload create` again to schedule your workload. + +* Create Vertex AI Experiment with default Experiment name: + +```shell +python3 xpk.py workload create \ +--cluster xpk-test --workload xpk-workload \ +--use-vertex-tensorboard +``` + +will create a Vertex AI Experiment with the name `xpk-test-xpk-workload` (*-*). + +* Create Vertex AI Experiment with user-specified Experiment name: + +```shell +python3 xpk.py workload create \ +--cluster xpk-test --workload xpk-workload \ +--use-vertex-tensorboard --experiment-name=test-experiment +``` + +will create a Vertex AI Experiment with the name `test-experiment`. + +Check out [MaxText example](https://github.com/google/maxtext/pull/570) on how to update your workload to automatically upload logs collected in your Tensorboard directory to the Vertex AI Experiment created by `workload create`. + +## Workload Delete +* Workload Delete (delete training job): + + ```shell + python3 xpk.py workload delete \ + --workload xpk-test-workload --cluster xpk-test + ``` + + This will only delete `xpk-test-workload` workload in `xpk-test` cluster. + +* Workload Delete (delete all training jobs in the cluster): + + ```shell + python3 xpk.py workload delete \ + --cluster xpk-test + ``` + + This will delete all the workloads in `xpk-test` cluster. Deletion will only begin if you type `y` or `yes` at the prompt. Multiple workload deletions are processed in batches for optimized processing. + +* Workload Delete supports filtering. Delete a portion of jobs that match user criteria. + * Filter by Job: `filter-by-job` + + ```shell + python3 xpk.py workload delete \ + --cluster xpk-test --filter-by-job=$USER + ``` + + This will delete all the workloads in `xpk-test` cluster whose names start with `$USER`. Deletion will only begin if you type `y` or `yes` at the prompt. + + * Filter by Status: `filter-by-status` + + ```shell + python3 xpk.py workload delete \ + --cluster xpk-test --filter-by-status=QUEUED + ``` + + This will delete all the workloads in `xpk-test` cluster that have the status as Admitted or Evicted, and the number of running VMs is 0. Deletion will only begin if you type `y` or `yes` at the prompt. Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL`. + +## Workload List +* Workload List (see training jobs): + + ```shell + python3 xpk.py workload list \ + --cluster xpk-test + ``` + +* Example Workload List Output: + + The below example shows four jobs of different statuses: + + * `user-first-job-failed`: **filter-status** is `FINISHED` and `FAILED`. + * `user-second-job-success`: **filter-status** is `FINISHED` and `SUCCESSFUL`. + * `user-third-job-running`: **filter-status** is `RUNNING`. + * `user-forth-job-in-queue`: **filter-status** is `QUEUED`. + * `user-fifth-job-in-queue-preempted`: **filter-status** is `QUEUED`. + + ``` + Jobset Name Created Time Priority TPU VMs Needed TPU VMs Running/Ran TPU VMs Done Status Status Message Status Time + user-first-job-failed 2023-1-1T1:00:00Z medium 4 4 Finished JobSet failed 2023-1-1T1:05:00Z + user-second-job-success 2023-1-1T1:10:00Z medium 4 4 4 Finished JobSet finished successfully 2023-1-1T1:14:00Z + user-third-job-running 2023-1-1T1:15:00Z medium 4 4 Admitted Admitted by ClusterQueue cluster-queue 2023-1-1T1:16:00Z + user-forth-job-in-queue 2023-1-1T1:16:05Z medium 4 Admitted couldn't assign flavors to pod set slice-job: insufficient unused quota for google.com/tpu in flavor 2xv4-8, 4 more need 2023-1-1T1:16:10Z + user-fifth-job-preempted 2023-1-1T1:10:05Z low 4 Evicted Preempted to accommodate a higher priority Workload 2023-1-1T1:10:00Z + ``` + +* Workload List supports filtering. Observe a portion of jobs that match user criteria. + + * Filter by Status: `filter-by-status` + + Filter the workload list by the status of respective jobs. + Status can be: `EVERYTHING`,`FINISHED`, `RUNNING`, `QUEUED`, `FAILED`, `SUCCESSFUL` + + * Filter by Job: `filter-by-job` + + Filter the workload list by the name of a job. + + ```shell + python3 xpk.py workload list \ + --cluster xpk-test --filter-by-job=$USER + ``` + +* Workload List supports waiting for the completion of a specific job. XPK will follow an existing job until it has finished or the `timeout`, if provided, has been reached and then list the job. If no `timeout` is specified, the default value is set to the max value, 1 week. You may also set `timeout=0` to poll the job once. + + Wait for a job to complete. + + ```shell + python3 xpk.py workload list \ + --cluster xpk-test --wait-for-job-completion=xpk-test-workload + ``` + + Wait for a job to complete with a timeout of 300 seconds. + + ```shell + python3 xpk.py workload list \ + --cluster xpk-test --wait-for-job-completion=xpk-test-workload \ + --timeout=300 + ``` + + Return codes + `0`: Workload finished and completed successfully. + `124`: Timeout was reached before workload finished. + `125`: Workload finished but did not complete successfully. + `1`: Other failure. diff --git a/src/xpk/README.md b/src/xpk/README.md deleted file mode 100644 index c603098eb..000000000 --- a/src/xpk/README.md +++ /dev/null @@ -1,10 +0,0 @@ -## Code structure - -`xpk` package consists of three packages -- `parsers` - user-facing code that parses commands: `xpk cluster`, `xpk workload`, `xpk inspector` -- `commands` - code responsible for handling parsed commands -- `core` - building blocks for `commands` package with all of the `gcloud` invocations -- `utils` - contains utility modules shared across the whole codebase - -Additionally there are modules -- `main.py` - serves as an entrypoint to the xpk \ No newline at end of file From a69444e835d29730fba875109bbe5ef581cd141b Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Fri, 10 Oct 2025 11:26:58 +0000 Subject: [PATCH 2/9] docs: fix links and reorganize index --- README.md | 16 ++++++++-------- docs/usage/workloads.md | 4 ++-- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index d68924ee9..dd1022e50 100644 --- a/README.md +++ b/README.md @@ -46,9 +46,9 @@ XPK supports the following TPU types: and the following GPU types: * A100 * A3-Highgpu (h100) -* A3-Mega (h100-mega) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A3-Ultra (h200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A4 (b200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A3-Mega (h100-mega) - [Create cluster](./docs/usage/clusters.md), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A3-Ultra (h200) - [Create cluster](./docs/usage/clusters.md), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A4 (b200) - [Create cluster](./docs/usage/clusters.md), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) * A4X (gb200) and the following CPU types: @@ -64,14 +64,14 @@ XPK also supports [Google Cloud Storage solutions](./docs/usage/storage.md): * [Permissions](./docs/permissions.md) * [Installation](./docs/installation.md) -* [Usage](./docs/usage/) +* Usage: * [Clusters](./docs/usage/clusters.md) + * [GPU](./docs/usage/gpu.md) + * [CPU](./docs/usage/cpu.md) + * [Autoprovisioning](./docs/usage/autoprovisioning.md) * [Workloads](./docs/usage/workloads.md) + * [Docker](./docs/usage/docker.md) * [Storage](./docs/usage/storage.md) - * [GPU](./docs/usage/gpu.md) - * [CPU](./docs/usage/cpu.md) - * [Autoprovisioning](./docs/usage/autoprovisioning.md) - * [Docker](./docs/usage/docker.md) * [Advanced](./docs/usage/advanced.md) * [Inspector](./docs/usage/inspector.md) * [Run](./docs/usage/run.md) diff --git a/docs/usage/workloads.md b/docs/usage/workloads.md index 768b57786..b900f0519 100644 --- a/docs/usage/workloads.md +++ b/docs/usage/workloads.md @@ -70,7 +70,7 @@ hardware failures and software updates. We assume your job has implemented checkpointing so the job restarts near where it was interrupted. ### Workloads for A3 Ultra, A3 Mega and A4 clusters (GPU machines) -To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). +To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](../clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). Machine | Device type @@ -89,7 +89,7 @@ python3 xpk.py workload create \ > The docker image flags/arguments introduced in [workloads section](#workload-create) can be used with A3 or A4 machines as well. -In order to run NCCL test on A3 machines check out [this guide](/examples/nccl/nccl.md). +In order to run NCCL test on A3 machines check out [this guide](../../examples/nccl/nccl.md). ### Workload Priority and Preemption * Set the priority level of your workload with `--priority=LEVEL` From 4d074c07d25dacf3d349feb63f0d2445513d23bb Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Fri, 10 Oct 2025 12:27:07 +0000 Subject: [PATCH 3/9] docs: fix links --- README.md | 6 +++--- docs/usage/clusters.md | 44 +++++++++++++++++++++++++++++++++++++++++ docs/usage/workloads.md | 2 +- 3 files changed, 48 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index dd1022e50..64f10d8e0 100644 --- a/README.md +++ b/README.md @@ -46,9 +46,9 @@ XPK supports the following TPU types: and the following GPU types: * A100 * A3-Highgpu (h100) -* A3-Mega (h100-mega) - [Create cluster](./docs/usage/clusters.md), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A3-Ultra (h200) - [Create cluster](./docs/usage/clusters.md), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A4 (b200) - [Create cluster](./docs/usage/clusters.md), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A3-Mega (h100-mega) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A3-Ultra (h200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) +* A4 (b200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) * A4X (gb200) and the following CPU types: diff --git a/docs/usage/clusters.md b/docs/usage/clusters.md index 54962fecd..2ba976b95 100644 --- a/docs/usage/clusters.md +++ b/docs/usage/clusters.md @@ -259,3 +259,47 @@ will fail the cluster creation process because Vertex AI Tensorboard is not supp --cluster xpk-test --docker-image gcr.io/your_docker_image \ --tpu-type=v5litepod-16 ``` + +## Provisioning A3 Ultra, A3 Mega and A4 clusters (GPU machines) +To create a cluster with A3 or A4 machines, run the command below with selected device type. To create workloads on these clusters see [here](#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). + +**Note:** Creating A3 Ultra, A3 Mega and A4 clusters is currently supported **only** on linux/amd64 architecture. + +Machine | Device type +:- | :- +A3 Mega | `h100-mega-80gb-8` +A3 Ultra | `h200-141gb-8` +A4 | `b200-8` + + +```shell +python3 xpk.py cluster create \ + --cluster CLUSTER_NAME --device-type DEVICE_TYPE \ + --zone=$COMPUTE_ZONE --project=$PROJECT_ID \ + --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID +``` + +Currently, the below flags/arguments are supported for A3 Mega, A3 Ultra and A4 machines: + * `--num-nodes` + * `--default-pool-cpu-machine-type` + * `--default-pool-cpu-num-nodes` + * `--reservation` + * `--spot` + * `--on-demand` (A3 Mega only) + * `--flex` + +## Running XPK on existing clusters + +In order to run XPK commands on a cluster it needs to be set up correctly. This is done automatically when creating a cluster using `xpk cluster create`. For clusters created differently (e.g.: with 'gcloud' or a Cluster Toolkit blueprint) there is a dedicated command: `xpk cluster adapt`. This command installs required config maps, kueue, jobset, CSI drivers etc. + +Currently `xpk cluster adapt` supports only the following device types: + +- `h200-141gb-8` (A3 Ultra) + +Example usage: +```shell +python3 xpk.py cluster adapt \ + --cluster=$CLUSTER_NAME --device-type=$DEVICE_TYPE \ + --zone=$COMPUTE_ZONE --project=$PROJECT_ID \ + --num-nodes=$NUM_NODES --reservation=$RESERVATION_ID +``` diff --git a/docs/usage/workloads.md b/docs/usage/workloads.md index b900f0519..15f92fc98 100644 --- a/docs/usage/workloads.md +++ b/docs/usage/workloads.md @@ -70,7 +70,7 @@ hardware failures and software updates. We assume your job has implemented checkpointing so the job restarts near where it was interrupted. ### Workloads for A3 Ultra, A3 Mega and A4 clusters (GPU machines) -To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](../clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). +To submit jobs on a cluster with A3 or A4 machines, run the command with selected device type. To create a cluster with A3 or A4 machines see [here](./clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines). Machine | Device type From c4ea14c933cab7c8f94538671d11456793972052 Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Fri, 10 Oct 2025 12:32:21 +0000 Subject: [PATCH 4/9] docs: refactor README.md to use tables --- README.md | 24 +++++++++++++++++++++++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 64f10d8e0..56f07e95b 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ + # Code of Conduct ## Our Pledge diff --git a/docs/contributing.md b/docs/contributing.md index 28a29b52b..1916348e6 100644 --- a/docs/contributing.md +++ b/docs/contributing.md @@ -1,3 +1,19 @@ + + # How to Contribute We would love to accept your patches and contributions to this project. diff --git a/docs/installation.md b/docs/installation.md index f40f5de3a..ad33b8340 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,4 +1,19 @@ - + + # Installation There are 2 ways to install XPK: diff --git a/docs/local_testing.md b/docs/local_testing.md index e88af0f12..113a6682a 100644 --- a/docs/local_testing.md +++ b/docs/local_testing.md @@ -1,4 +1,19 @@ - + + # Local testing with Kind To facilitate development and testing locally, we have integrated support for testing with `kind`. This enables you to simulate a Kubernetes environment on your local machine. diff --git a/docs/permissions.md b/docs/permissions.md index b26d4da43..65808f68e 100644 --- a/docs/permissions.md +++ b/docs/permissions.md @@ -1,4 +1,19 @@ + + # Permissions needed on Cloud Console: * Artifact Registry Writer diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index 50ac70766..d6964bce6 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -1,4 +1,19 @@ - + + # Troubleshooting ## `Invalid machine type` for CPUs. diff --git a/docs/usage/advanced.md b/docs/usage/advanced.md index 9ae900701..6c80b36f6 100644 --- a/docs/usage/advanced.md +++ b/docs/usage/advanced.md @@ -1,3 +1,18 @@ + # More advanced facts: diff --git a/docs/usage/autoprovisioning.md b/docs/usage/autoprovisioning.md index 1d5b8af1d..0901f7184 100644 --- a/docs/usage/autoprovisioning.md +++ b/docs/usage/autoprovisioning.md @@ -1,3 +1,19 @@ + + # Autoprovisioning with XPK XPK can dynamically allocate cluster capacity using [Node Auto Provisioning, (NAP)](https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#use_accelerators_for_new_auto-provisioned_node_pools) support. diff --git a/docs/usage/clusters.md b/docs/usage/clusters.md index 2ba976b95..649d9e000 100644 --- a/docs/usage/clusters.md +++ b/docs/usage/clusters.md @@ -1,3 +1,19 @@ + + ## Cluster Create First set the project and zone through gcloud config or xpk arguments. diff --git a/docs/usage/cpu.md b/docs/usage/cpu.md index 9197bdf95..640fa8eb9 100644 --- a/docs/usage/cpu.md +++ b/docs/usage/cpu.md @@ -1,3 +1,19 @@ + + ## CPU usage In order to use XPK for CPU, you can do so by using `device-type` flag. diff --git a/docs/usage/docker.md b/docs/usage/docker.md index 0c193cb9a..2a9e5a753 100644 --- a/docs/usage/docker.md +++ b/docs/usage/docker.md @@ -1,3 +1,19 @@ + + # How to add docker images to a xpk workload The default behavior is `xpk workload create` will layer the local directory (`--script-dir`) into diff --git a/docs/usage/gpu.md b/docs/usage/gpu.md index 0e6072074..20f9be41c 100644 --- a/docs/usage/gpu.md +++ b/docs/usage/gpu.md @@ -1,3 +1,19 @@ + + ## GPU usage In order to use XPK for GPU, you can do so by using `device-type` flag. diff --git a/docs/usage/inspector.md b/docs/usage/inspector.md index cc2e79122..dd9b1f3dc 100644 --- a/docs/usage/inspector.md +++ b/docs/usage/inspector.md @@ -1,3 +1,19 @@ + + ## Inspector * Inspector provides debug info to understand cluster health, and why workloads are not running. Inspector output is saved to a file. diff --git a/docs/usage/job.md b/docs/usage/job.md index 57d326e8a..4f724ee9d 100644 --- a/docs/usage/job.md +++ b/docs/usage/job.md @@ -1,4 +1,19 @@ - + + ## Job List * Job List (see jobs submitted via batch command): diff --git a/docs/usage/run.md b/docs/usage/run.md index dc25a5a62..beb771e34 100644 --- a/docs/usage/run.md +++ b/docs/usage/run.md @@ -1,3 +1,18 @@ + ## Run * `xpk run` lets you execute scripts on a cluster with ease. It automates task execution, handles interruptions, and streams job output to your console. diff --git a/docs/usage/storage.md b/docs/usage/storage.md index 1895f1cc0..6f34f8036 100644 --- a/docs/usage/storage.md +++ b/docs/usage/storage.md @@ -1,3 +1,19 @@ + + ## Storage Currently XPK supports the below types of storages: - [Cloud Storage FUSE](#fuse) diff --git a/docs/usage/workloads.md b/docs/usage/workloads.md index 15f92fc98..84cf36b61 100644 --- a/docs/usage/workloads.md +++ b/docs/usage/workloads.md @@ -1,3 +1,19 @@ + + ## Workload Create * Workload Create (submit training job): From 65bf39d8efd3370673d4ede197d96e21452fc53f Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Fri, 10 Oct 2025 12:35:54 +0000 Subject: [PATCH 6/9] docs: fix formatting --- README.md | 21 --------------------- 1 file changed, 21 deletions(-) diff --git a/README.md b/README.md index 56f07e95b..cde7c57cb 100644 --- a/README.md +++ b/README.md @@ -36,12 +36,6 @@ return the hardware back to the shared pool when they complete, developers can achieve better use of finite hardware resources. And automated tests can run overnight while resources tend to be underutilized. -XPK supports the following TPU types: -* v4 -* v5e -* v5p -* Trillium (v6e) -* Ironwood (tpu7x) XPK supports a variety of hardware accelerators. | Accelerator | Type | Create Cluster | Create Workload | |-------------|--------------------|---------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------| @@ -58,17 +52,8 @@ XPK supports a variety of hardware accelerators. | GPU | A4X (gb200) | [docs](./docs/usage/gpu.md) | [docs](./docs/usage/workloads.md) | | CPU | n2-standard-32 | [docs](./docs/usage/cpu.md) | [docs](./docs/usage/workloads.md) | -and the following GPU types: -* A100 -* A3-Highgpu (h100) -* A3-Mega (h100-mega) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A3-Ultra (h200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A4 (b200) - [Create cluster](./docs/usage/clusters.md#provisioning-a3-ultra-a3-mega-and-a4-clusters-gpu-machines), [Create workloads](./docs/usage/workloads.md#workloads-for-a3-ultra-a3-mega-and-a4-clusters-gpu-machines) -* A4X (gb200) XPK also supports the following [Google Cloud Storage solutions](./docs/usage/storage.md): -and the following CPU types: -* n2-standard-32 | Storage Type | Documentation | |--------------------------------------------|------------------------------------------------------------------------------------------| | Cloud Storage FUSE | [docs](./docs/usage/storage.md#fuse) | @@ -76,12 +61,6 @@ and the following CPU types: | Parallelstore | [docs](./docs/usage/storage.md#parallelstore) | | Block storage (Persistent Disk, Hyperdisk) | [docs](./docs/usage/storage.md#block-storage-persistent-disk-hyperdisk) | -XPK also supports [Google Cloud Storage solutions](./docs/usage/storage.md): -* [Cloud Storage FUSE](./docs/usage/storage.md#fuse) -* [Filestore](./docs/usage/storage.md#filestore) -* [Parallelstore](./docs/usage/storage.md#parallelstore) -* [Block storage (Persistent Disk, Hyperdisk)](./docs/usage/storage.md#block-storage-persistent-disk-hyperdisk) - # Documentation * [Permissions](./docs/permissions.md) From e586e1deb6f3803fb0374f15f503571de741326e Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Fri, 10 Oct 2025 12:38:36 +0000 Subject: [PATCH 7/9] docs: use xpk instead of python3 xpk.py --- docs/local_testing.md | 6 ++-- docs/troubleshooting.md | 10 +++---- docs/usage/autoprovisioning.md | 14 +++++----- docs/usage/clusters.md | 50 +++++++++++++++++----------------- docs/usage/cpu.md | 4 +-- docs/usage/docker.md | 12 ++++---- docs/usage/gpu.md | 20 +++++++------- docs/usage/inspector.md | 2 +- docs/usage/job.md | 4 +-- docs/usage/storage.md | 20 +++++++------- docs/usage/workloads.md | 34 +++++++++++------------ examples/batch.md | 2 +- examples/nccl/nccl.md | 6 ++-- xpk-slurm-commands.md | 28 +++++++++---------- 14 files changed, 106 insertions(+), 106 deletions(-) diff --git a/docs/local_testing.md b/docs/local_testing.md index 113a6682a..32bd27005 100644 --- a/docs/local_testing.md +++ b/docs/local_testing.md @@ -30,7 +30,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil * Cluster create: ```shell - python3 xpk.py kind create \ + xpk kind create \ --cluster xpk-test ``` @@ -38,7 +38,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil * Cluster Delete: ```shell - python3 xpk.py kind delete \ + xpk kind delete \ --cluster xpk-test ``` @@ -46,7 +46,7 @@ xpk interfaces seamlessly with kind to manage Kubernetes clusters locally, facil * Cluster List: ```shell - python3 xpk.py kind list + xpk kind list ``` ## Local Testing Basics diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md index d6964bce6..da62a71c3 100644 --- a/docs/troubleshooting.md +++ b/docs/troubleshooting.md @@ -29,7 +29,7 @@ Please select a CPU type that exists in all zones in the region. # Find CPU Types supported in zones. gcloud compute machine-types list --zones=$ZONE_LIST # Adjust default cpu machine type. -python3 xpk.py cluster create --default-pool-cpu-machine-type=CPU_TYPE ... +xpk cluster create --default-pool-cpu-machine-type=CPU_TYPE ... ``` ## Workload creation fails @@ -110,7 +110,7 @@ If error of this kind appeared after updating xpk version it's possible that you ## Verbose Logging If you are having trouble with your workload, try setting the `--enable-debug-logs` when you schedule it. This will give you more detailed logs to help pinpoint the issue. For example: ```shell -python3 xpk.py workload create \ +xpk workload create \ --cluster --workload xpk-test-workload \ --command="echo hello world" --enable-debug-logs ``` @@ -142,7 +142,7 @@ This configuration will start collecting stack traces inside the `/tmp/debugging ### Explore Stack Traces To explore the stack traces collected in a temporary directory in Kubernetes Pod, you can run the following command to configure a sidecar container that will read the traces from `/tmp/debugging` directory. ```shell -python3 xpk.py workload create \ +xpk workload create \ --workload xpk-test-workload --command "python3 main.py" --cluster \ xpk-test --tpu-type=v5litepod-16 --deploy-stacktrace-sidecar ``` @@ -153,12 +153,12 @@ To list available resources and queues use ```xpk info``` command. It allows to To see queues with usage and workload info use: ```shell -python3 xpk.py info --cluster my-cluster +xpk info --cluster my-cluster ``` You can specify what kind of resources(clusterqueue or localqueue) you want to see using flags --clusterqueue or --localqueue. ```shell -python3 xpk.py info --cluster my-cluster --localqueue +xpk info --cluster my-cluster --localqueue ``` ``` diff --git a/docs/usage/autoprovisioning.md b/docs/usage/autoprovisioning.md index 0901f7184..888d94c87 100644 --- a/docs/usage/autoprovisioning.md +++ b/docs/usage/autoprovisioning.md @@ -36,7 +36,7 @@ RESERVATION=reservation_id PROJECT=my_project ZONE=us-east5-b -python3 xpk.py cluster create \ +xpk cluster create \ --cluster $CLUSTER_NAME \ --num-slices=$NUM_SLICES \ --device-type=$DEVICE_TYPE \ @@ -73,7 +73,7 @@ RESERVATION=reservation_id PROJECT=my_project ZONE=us-east5-b -python3 xpk.py cluster create \ +xpk cluster create \ --cluster $CLUSTER_NAME \ --num-slices=$NUM_SLICES \ --device-type=$DEVICE_TYPE \ @@ -96,7 +96,7 @@ PROJECT=my_project ZONE=us-east5-b # This will create 2x v4-16 node pools and set the max autoprovisioned chips to 16. -python3 xpk.py cluster create \ +xpk cluster create \ --cluster $CLUSTER_NAME \ --num-slices=$NUM_SLICES \ --device-type=$DEVICE_TYPE \ @@ -116,7 +116,7 @@ PROJECT=my_project ZONE=us-east5-b # This will clear the node pools if they exist in the cluster and set the max autoprovisioned chips to 16 -python3 xpk.py cluster create \ +xpk cluster create \ --cluster $CLUSTER_NAME \ --num-slices=$NUM_SLICES \ --device-type=$DEVICE_TYPE \ @@ -137,7 +137,7 @@ NEW_RESERVATION=new_reservation_id PROJECT=my_project ZONE=us-east5-b # Create a 2x v4-8 TPU workload. -python3 xpk.py workload create \ +xpk workload create \ --cluster $CLUSTER \ --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ @@ -150,7 +150,7 @@ NUM_SLICES=1 DEVICE_TYPE=v4-16 # Create a 1x v4-16 TPU workload. -python3 xpk.py workload create \ +xpk workload create \ --cluster $CLUSTER \ --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ @@ -160,7 +160,7 @@ python3 xpk.py workload create \ --project=$PROJECT # Use a different reservation from what the cluster was created with. -python3 xpk.py workload create \ +xpk workload create \ --cluster $CLUSTER \ --workload ${USER}-nap-${NUM_SLICES}x${DEVICE_TYPE}_$(date +%H-%M-%S) \ --command "echo hello world from $NUM_SLICES $DEVICE_TYPE" \ diff --git a/docs/usage/clusters.md b/docs/usage/clusters.md index 649d9e000..96bb13093 100644 --- a/docs/usage/clusters.md +++ b/docs/usage/clusters.md @@ -37,7 +37,7 @@ all zones. # Find your reservations gcloud compute reservations list --project=$PROJECT_ID # Run cluster create with reservation. - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v5litepod-256 \ --num-slices=2 \ --reservation=$RESERVATION_ID @@ -46,7 +46,7 @@ all zones. * Cluster Create (provision on-demand capacity): ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ --num-slices=4 --on-demand ``` @@ -54,14 +54,14 @@ all zones. * Cluster Create (provision spot / preemptable capacity): ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ --num-slices=4 --spot ``` * Cluster Create (DWS flex queued capacity): ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ --num-slices=4 --flex ``` @@ -69,7 +69,7 @@ all zones. * Cluster Create for Pathways: Pathways compatible cluster can be created using `cluster create-pathways`. ```shell - python3 xpk.py cluster create-pathways \ + xpk cluster create-pathways \ --cluster xpk-pw-test \ --num-slices=4 --on-demand \ --tpu-type=v5litepod-16 @@ -79,7 +79,7 @@ Pathways compatible cluster can be created using `cluster create-pathways`. * Cluster Create for Ray: A cluster with KubeRay enabled and a RayCluster can be created using `cluster create-ray`. ```shell - python3 xpk.py cluster create-ray \ + xpk cluster create-ray \ --cluster xpk-rc-test \ --ray-version=2.39.0 \ --num-slices=4 --on-demand \ @@ -92,7 +92,7 @@ Pathways compatible cluster can be created using `cluster create-pathways`. For example, if a user creates a cluster with 4 slices: ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ --num-slices=4 --reservation=$RESERVATION_ID ``` @@ -101,7 +101,7 @@ Pathways compatible cluster can be created using `cluster create-pathways`. new slices: ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ --num-slices=8 --reservation=$RESERVATION_ID ``` @@ -111,13 +111,13 @@ Pathways compatible cluster can be created using `cluster create-pathways`. Use `--force` to skip prompts. ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v5litepod-16 \ --num-slices=6 --reservation=$RESERVATION_ID # Skip delete prompts using --force. - python3 xpk.py cluster create --force \ + xpk cluster create --force \ --cluster xpk-test --tpu-type=v5litepod-16 \ --num-slices=6 --reservation=$RESERVATION_ID ``` @@ -127,13 +127,13 @@ Pathways compatible cluster can be created using `cluster create-pathways`. user when deleting slices. Use `--force` to skip prompts. ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --tpu-type=v4-8 \ --num-slices=4 --reservation=$RESERVATION_ID # Skip delete prompts using --force. - python3 xpk.py cluster create --force \ + xpk cluster create --force \ --cluster xpk-test --tpu-type=v4-8 \ --num-slices=4 --reservation=$RESERVATION_ID ``` @@ -164,7 +164,7 @@ This argument allows you to specify additional IP ranges (in CIDR notation) that * To create a private cluster and allow access to Control Plane only to your current machine: ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster=xpk-private-cluster \ --tpu-type=v4-8 --num-slices=2 \ --private @@ -173,7 +173,7 @@ This argument allows you to specify additional IP ranges (in CIDR notation) that * To create a private cluster and allow access to Control Plane only to your current machine and the IP ranges `1.2.3.0/24` and `1.2.4.5/32`: ```shell - python3 xpk.py cluster create \ + xpk cluster create \ --cluster=xpk-private-cluster \ --tpu-type=v4-8 --num-slices=2 \ --authorized-networks 1.2.3.0/24 1.2.4.5/32 @@ -199,7 +199,7 @@ You can create a Vertex AI Tensorboard for your cluster with `Cluster Create` co * Create Vertex AI Tensorboard in default region with default Tensorboard name: ```shell -python3 xpk.py cluster create \ +xpk cluster create \ --cluster xpk-test --num-slices=1 --tpu-type=v4-8 \ --create-vertex-tensorboard ``` @@ -209,7 +209,7 @@ will create a Vertex AI Tensorboard with the name `xpk-test-tb-instance` (* test.sh -python3 xpk.py workload create --cluster xpk-test \ +xpk workload create --cluster xpk-test \ --workload xpk-test-workload-base-image --command "bash test.sh" \ --tpu-type=v5litepod-16 --num-slices=1 ``` * Recommended Flow For Normal Sized Jobs (fewer than 10k accelerators): ```shell - python3 xpk.py workload create --cluster xpk-test \ + xpk workload create --cluster xpk-test \ --workload xpk-test-workload-base-image --command "bash custom_script.sh" \ --base-docker-image=gcr.io/your_dependencies_docker_image \ --tpu-type=v5litepod-16 --num-slices=1 @@ -56,17 +56,17 @@ workload. * Running with `--docker-image`: ```shell - python3 xpk.py workload create --cluster xpk-test \ + xpk workload create --cluster xpk-test \ --workload xpk-test-workload-base-image --command "bash test.sh" \ --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image ``` * Recommended Flow For Large Sized Jobs (more than 10k accelerators): ```shell - python3 xpk.py cluster cacheimage \ + xpk cluster cacheimage \ --cluster xpk-test --docker-image gcr.io/your_docker_image # Run workload create with the same image. -python3 xpk.py workload create --cluster xpk-test \ +xpk workload create --cluster xpk-test \ --workload xpk-test-workload-base-image --command "bash test.sh" \ --tpu-type=v5litepod-16 --num-slices=1 --docker-image=gcr.io/your_docker_image ``` diff --git a/docs/usage/gpu.md b/docs/usage/gpu.md index 20f9be41c..c9852fb9d 100644 --- a/docs/usage/gpu.md +++ b/docs/usage/gpu.md @@ -25,7 +25,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. gcloud compute reservations list --project=$PROJECT_ID # Run cluster create with reservation. - python3 xpk.py cluster create \ + xpk cluster create \ --cluster xpk-test --device-type=h100-80gb-8 \ --num-nodes=2 \ --reservation=$RESERVATION_ID @@ -34,20 +34,20 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. * Cluster Delete (deprovision capacity): ```shell - python3 xpk.py cluster delete \ + xpk cluster delete \ --cluster xpk-test ``` * Cluster List (see provisioned capacity): ```shell - python3 xpk.py cluster list + xpk cluster list ``` * Cluster Describe (see capacity): ```shell - python3 xpk.py cluster describe \ + xpk cluster describe \ --cluster xpk-test ``` @@ -55,7 +55,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. * Cluster Cacheimage (enables faster start times): ```shell - python3 xpk.py cluster cacheimage \ + xpk cluster cacheimage \ --cluster xpk-test --docker-image gcr.io/your_docker_image \ --device-type=h100-80gb-8 ``` @@ -76,7 +76,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. ```shell # Submit a workload - python3 xpk.py workload create \ + xpk workload create \ --cluster xpk-test --device-type h100-80gb-8 \ --workload xpk-test-workload \ --command="echo hello world" @@ -85,7 +85,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. * Workload Delete (delete training job): ```shell - python3 xpk.py workload delete \ + xpk workload delete \ --workload xpk-test-workload --cluster xpk-test ``` @@ -94,7 +94,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. * Workload Delete (delete all training jobs in the cluster): ```shell - python3 xpk.py workload delete \ + xpk workload delete \ --cluster xpk-test ``` @@ -104,7 +104,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. * Filter by Job: `filter-by-job` ```shell - python3 xpk.py workload delete \ + xpk workload delete \ --cluster xpk-test --filter-by-job=$USER ``` @@ -113,7 +113,7 @@ In order to use XPK for GPU, you can do so by using `device-type` flag. * Filter by Status: `filter-by-status` ```shell - python3 xpk.py workload delete \ + xpk workload delete \ --cluster xpk-test --filter-by-status=QUEUED ``` diff --git a/docs/usage/inspector.md b/docs/usage/inspector.md index dd9b1f3dc..824e504b6 100644 --- a/docs/usage/inspector.md +++ b/docs/usage/inspector.md @@ -19,7 +19,7 @@ Inspector output is saved to a file. ```shell - python3 xpk.py inspector \ + xpk inspector \ --cluster $CLUSTER_NAME \ --project $PROJECT_ID \ --zone $ZONE diff --git a/docs/usage/job.md b/docs/usage/job.md index 4f724ee9d..8f353274a 100644 --- a/docs/usage/job.md +++ b/docs/usage/job.md @@ -19,7 +19,7 @@ * Job List (see jobs submitted via batch command): ```shell - python3 xpk.py job ls --cluster xpk-test + xpk job ls --cluster xpk-test ``` * Example Job List Output: @@ -37,5 +37,5 @@ * Job Cancel (delete job submitted via batch command): ```shell - python3 xpk.py job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test + xpk job cancel xpk-def-app-profile-slurm-74kbv --cluster xpk-test ``` diff --git a/docs/usage/storage.md b/docs/usage/storage.md index 6f34f8036..4422c0718 100644 --- a/docs/usage/storage.md +++ b/docs/usage/storage.md @@ -30,7 +30,7 @@ To use the GCS FUSE with XPK you need to create a [Storage Bucket](https://conso Once it's ready you can use `xpk storage attach` with `--type=gcsfuse` command to attach a FUSE storage instance to your cluster: ```shell -python3 xpk.py storage attach test-fuse-storage --type=gcsfuse \ +xpk storage attach test-fuse-storage --type=gcsfuse \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE --mount-point='/test-mount-point' --readonly=false \ --bucket=test-bucket --size=1 --auto-mount=false @@ -55,7 +55,7 @@ A Filestore adapter lets you mount and access [Filestore instances](https://clou To create and attach a GCP Filestore instance to your cluster use `xpk storage create` command with `--type=gcpfilestore`: ```shell -python3 xpk.py storage create test-fs-storage --type=gcpfilestore \ +xpk storage create test-fs-storage --type=gcpfilestore \ --auto-mount=false --mount-point=/data-fs --readonly=false \ --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE @@ -64,7 +64,7 @@ python3 xpk.py storage create test-fs-storage --type=gcpfilestore \ You can also attach an existing Filestore instance to your cluster using `xpk storage attach` command: ```shell -python3 xpk.py storage attach test-fs-storage --type=gcpfilestore \ +xpk storage attach test-fs-storage --type=gcpfilestore \ --auto-mount=false --mount-point=/data-fs --readonly=false \ --size=1024 --tier=BASIC_HDD --access_mode=ReadWriteMany --vol=default \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE @@ -93,7 +93,7 @@ To use the GCS Parallelstore with XPK you need to create a [Parallelstore Instan Once it's ready you can use `xpk storage attach` with `--type=parallelstore` command to attach a Parallelstore instance to your cluster. Currently, attaching a Parallelstore is supported only by providing a manifest file. ```shell -python3 xpk.py storage attach test-parallelstore-storage --type=parallelstore \ +xpk storage attach test-parallelstore-storage --type=parallelstore \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ --mount-point='/test-mount-point' --readonly=false \ --auto-mount=true \ @@ -117,7 +117,7 @@ To use the GCE PersistentDisk with XPK you need to create a [disk in GCE](https: Once it's ready you can use `xpk storage attach` with `--type=pd` command to attach a PersistentDisk instance to your cluster. Currently, attaching a PersistentDisk is supported only by providing a manifest file. ```shell -python3 xpk.py storage attach test-pd-storage --type=pd \ +xpk storage attach test-pd-storage --type=pd \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ --mount-point='/test-mount-point' --readonly=false \ --auto-mount=true \ @@ -141,7 +141,7 @@ To use the GCP Managed Lustre with XPK you need to create [an instance](https:// Once it's ready you can use `xpk storage attach` with `--type=lustre` command to attach a Managed Lustre instance to your cluster. Currently, attaching a Managed Lustre instance is supported only by providing a manifest file. ```shell -python3 xpk.py storage attach test-lustre-storage --type=lustre \ +xpk storage attach test-lustre-storage --type=lustre \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ --mount-point='/test-mount-point' --readonly=false \ --auto-mount=true \ @@ -159,7 +159,7 @@ Parameters: ### List attached storages ```shell -python3 xpk.py storage list \ +xpk storage list \ --project=$PROJECT --cluster $CLUSTER --zone=$ZONE ``` @@ -168,7 +168,7 @@ python3 xpk.py storage list \ If you specified `--auto-mount=true` when creating or attaching a storage, then all workloads deployed on the cluster will have the volume attached by default. Otherwise, in order to have the storage attached, you have to add `--storage` parameter to `workload create` command: ```shell -python3 xpk.py workload create \ +xpk workload create \ --workload xpk-test-workload --command "echo goodbye" \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE \ --tpu-type=v5litepod-16 --storage=test-storage @@ -177,7 +177,7 @@ python3 xpk.py workload create \ ### Detaching storage ```shell -python3 xpk.py storage detach $STORAGE_NAME \ +xpk storage detach $STORAGE_NAME \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE ``` @@ -186,6 +186,6 @@ python3 xpk.py storage detach $STORAGE_NAME \ XPK allows you to remove Filestore instances easily with `xpk storage delete` command. **Warning:** this deletes all data contained in the Filestore! ```shell -python3 xpk.py storage delete test-fs-instance \ +xpk storage delete test-fs-instance \ --project=$PROJECT --cluster=$CLUSTER --zone=$ZONE ``` diff --git a/docs/usage/workloads.md b/docs/usage/workloads.md index 84cf36b61..c4723e84a 100644 --- a/docs/usage/workloads.md +++ b/docs/usage/workloads.md @@ -18,14 +18,14 @@ * Workload Create (submit training job): ```shell - python3 xpk.py workload create \ + xpk workload create \ --workload xpk-test-workload --command "echo goodbye" \ --cluster xpk-test \ --tpu-type=v5litepod-16 --project=$PROJECT ``` * Workload create(DWS flex with queued provisioning): ```shell - python3 xpk.py workload create \ + xpk workload create \ --workload xpk-test-workload --command "echo goodbye" \ --cluster xpk-test --flex \ --tpu-type=v5litepod-16 --project=$PROJECT @@ -35,7 +35,7 @@ Pathways workload example: ```shell - python3 xpk.py workload create-pathways \ + xpk workload create-pathways \ --workload xpk-pw-test \ --num-slices=1 \ --tpu-type=v5litepod-16 \ @@ -49,7 +49,7 @@ Pathways workload example: ```shell - python3 xpk.py workload create-pathways \ + xpk workload create-pathways \ --workload xpk-regular-test \ --num-slices=1 \ --tpu-type=v5litepod-16 \ @@ -62,7 +62,7 @@ Pathways in headless mode - Pathways now offers the capability to run JAX workloads in Vertex AI notebooks or in GCE VMs! Specify `--headless` with `workload create-pathways` when the user workload is not provided in a docker container. ```shell - python3 xpk.py workload create-pathways --headless \ + xpk workload create-pathways --headless \ --workload xpk-pw-headless \ --num-slices=1 \ --tpu-type=v5litepod-16 \ @@ -96,7 +96,7 @@ A3 Ultra | `h200-141gb-8` A4 | `b200-8` ```shell -python3 xpk.py workload create \ +xpk workload create \ --workload=$WORKLOAD_NAME --command="echo goodbye" \ --cluster=$CLUSTER_NAME --device-type DEVICE_TYPE \ --zone=$COMPUTE_ZONE --project=$PROJECT_ID \ @@ -127,7 +127,7 @@ In order to run NCCL test on A3 machines check out [this guide](../../examples/n #### General Example: ```shell - python3 xpk.py workload create \ + xpk workload create \ --workload xpk-test-medium-workload --command "echo goodbye" --cluster \ xpk-test --tpu-type=v5litepod-16 --priority=medium ``` @@ -144,7 +144,7 @@ XPK will create a Vertex AI Experiment in `workload create` command and attach t * Create Vertex AI Experiment with default Experiment name: ```shell -python3 xpk.py workload create \ +xpk workload create \ --cluster xpk-test --workload xpk-workload \ --use-vertex-tensorboard ``` @@ -154,7 +154,7 @@ will create a Vertex AI Experiment with the name `xpk-test-xpk-workload` (* ### 3. xpk job info | sacct To see the details of the job you submitted you can use xpk job info command. ```shell -python3 xpk.py job info JOB NAME \ +xpk job info JOB NAME \ --project $PROJECT \ --zone $ZONE \ --cluster $CLUSTER From 83c6c2a66cf7c14964ba8623f8e5f8ea09b2c2db Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Tue, 14 Oct 2025 10:52:00 +0000 Subject: [PATCH 8/9] docs: fix formatting of code snippet --- docs/usage/workloads.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/usage/workloads.md b/docs/usage/workloads.md index c4723e84a..592bbca90 100644 --- a/docs/usage/workloads.md +++ b/docs/usage/workloads.md @@ -24,11 +24,12 @@ --tpu-type=v5litepod-16 --project=$PROJECT ``` * Workload create(DWS flex with queued provisioning): - ```shell + ```shell xpk workload create \ --workload xpk-test-workload --command "echo goodbye" \ --cluster xpk-test --flex \ --tpu-type=v5litepod-16 --project=$PROJECT + ``` * Workload Create for Pathways: Pathways workload can be submitted using `workload create-pathways` on a Pathways enabled cluster (created with `cluster create-pathways`) From 0b509db0f5bdb48894a1d286c0d4e89fc4614370 Mon Sep 17 00:00:00 2001 From: Konrad Kaim Date: Tue, 14 Oct 2025 10:52:33 +0000 Subject: [PATCH 9/9] docs: add missing space --- docs/usage/workloads.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/usage/workloads.md b/docs/usage/workloads.md index 592bbca90..e541f7b54 100644 --- a/docs/usage/workloads.md +++ b/docs/usage/workloads.md @@ -23,7 +23,7 @@ --cluster xpk-test \ --tpu-type=v5litepod-16 --project=$PROJECT ``` -* Workload create(DWS flex with queued provisioning): +* Workload create (DWS flex with queued provisioning): ```shell xpk workload create \ --workload xpk-test-workload --command "echo goodbye" \