Skip to content

A script for setting the pool #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
May 1, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
68cd948
the smallest thing that creates a pool
natemcintosh Jan 29, 2025
c833b25
added image and autoscale
natemcintosh Jan 29, 2025
97ffe8a
forgot pool id
natemcintosh Jan 29, 2025
1666c7f
more progress
natemcintosh Jan 30, 2025
a1c7dc9
Innovate team says to leave commented for now
natemcintosh Jan 30, 2025
33fb5e6
mostly working, but not how I envisioned it
natemcintosh Mar 7, 2025
6040972
Merge branch 'main' into nam-create-pool-script
natemcintosh Mar 7, 2025
5dc3fba
replace dependency with constant string
natemcintosh Mar 10, 2025
8549fc0
attempt to refactor create pool workflow
natemcintosh Mar 10, 2025
f0f6d31
needed to run az login?
natemcintosh Mar 10, 2025
62b26d2
maybe this login is not necessary
natemcintosh Mar 11, 2025
c268b04
make cred names match GH creds
natemcintosh Mar 11, 2025
1cc381f
make secrets available as env vars
natemcintosh Mar 11, 2025
0e9b797
try different resource group
natemcintosh Mar 11, 2025
55daa4f
Merge branch 'main' into nam-create-pool-script
natemcintosh Mar 11, 2025
5bf7b92
using wrong env
natemcintosh Mar 11, 2025
e273538
Merge branch 'main' into nam-create-pool-script
natemcintosh Mar 12, 2025
bc3f47f
try printing out the response
natemcintosh Apr 15, 2025
e9de774
Merge branch 'main' into nam-create-pool-script
natemcintosh Apr 15, 2025
b332dc1
try just calling the az command
natemcintosh Apr 15, 2025
4fa6f4f
Merge branch 'main' into nam-create-pool-script
natemcintosh Apr 15, 2025
09f2139
use env pool id
natemcintosh Apr 15, 2025
a94c1c5
This works, all the way through post
natemcintosh Apr 15, 2025
bb27868
try getting rid of the config toml stuff
natemcintosh Apr 15, 2025
6d656b1
once again using the wrong secret
natemcintosh Apr 16, 2025
9188dae
updating subnet_id var
giomrella Apr 16, 2025
03d1246
updating container_image_name var
giomrella Apr 16, 2025
511ee77
use url not server
natemcintosh Apr 17, 2025
586ae59
use server not url
natemcintosh Apr 17, 2025
190f8c0
moving from self hosted runner to runner-action (#253)
giomrella Apr 18, 2025
cf47711
keep env vars up to date
natemcintosh Apr 18, 2025
161331f
Removed invalid condition referencing steps.check_pool_id.outputs.poo…
giomrella Apr 18, 2025
ddf8495
Merge branch 'main' into nam-create-pool-script
micahwiesner67 Apr 21, 2025
9825eec
update docs on script
natemcintosh Apr 21, 2025
0335295
remove unused files
natemcintosh Apr 21, 2025
203ecc9
Merge branch 'main' into nam-create-pool-script
natemcintosh May 1, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions .github/scripts/create_pool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# /// script
# requires-python = ">=3.13"
# dependencies = [
# "azure-batch",
# "azure-identity",
# "azure-mgmt-batch",
# "msrest",
# ]
# ///
"""
If running locally, use:
uv run --env-file .env .github/scripts/create_pool.py
Requires a `.env` file with at least the following:
BATCH_ACCOUNT="<batch account name>"
SUBSCRIPTION_ID="<azure subscription id>"
BATCH_USER_ASSIGNED_IDENTITY="<user assigned identity>"
AZURE_BATCH_ACCOUNT_CLIENT_ID="<azure client id>"
PRINCIPAL_ID="<principal id>"
CONTAINER_REGISTRY_SERVER="<container registry server>"
CONTAINER_IMAGE_NAME="https://full-cr-server/<container image name>:tag"
POOL_ID="<pool id>"
SUBNET_ID="<subnet id>"
RESOURCE_GROUP="<resource group name>"

If running in CI, all of the above environment variables should be set in the repo
secrets.
"""

import os

from azure.identity import DefaultAzureCredential
from azure.mgmt.batch import BatchManagementClient

AUTO_SCALE_FORMULA = """
// In this example, the pool size
// is adjusted based on the number of tasks in the queue.
// Note that both comments and line breaks are acceptable in formula strings.

// Get pending tasks for the past 5 minutes.
$samples = $ActiveTasks.GetSamplePercent(TimeInterval_Minute * 5);
// If we have fewer than 70 percent data points, we use the last sample point, otherwise we use the maximum of last sample point and the history average.
$tasks = $samples < 70 ? max(0, $ActiveTasks.GetSample(1)) :
max( $ActiveTasks.GetSample(1), avg($ActiveTasks.GetSample(TimeInterval_Minute * 5)));
// If number of pending tasks is not 0, set targetVM to pending tasks, otherwise half of current dedicated.
$targetVMs = $tasks > 0 ? $tasks : max(0, $TargetDedicatedNodes / 2);
// The pool size is capped at 100, if target VM value is more than that, set it to 100.
cappedPoolSize = 100;
$TargetDedicatedNodes = max(0, min($targetVMs, cappedPoolSize));
// Set node deallocation mode - keep nodes active only until tasks finish
$NodeDeallocationOption = taskcompletion;
"""


def main() -> None:
# Create the BatchManagementClient
batch_mgmt_client = BatchManagementClient(
credential=DefaultAzureCredential(),
subscription_id=os.environ["SUBSCRIPTION_ID"],
)

# Assemble the pool parameters
pool_parameters = {
"identity": {
"type": "UserAssigned",
"userAssignedIdentities": {
os.environ["BATCH_USER_ASSIGNED_IDENTITY"]: {
"clientId": os.environ["AZURE_BATCH_ACCOUNT_CLIENT_ID"],
"principalId": os.environ["PRINCIPAL_ID"],
}
},
},
"properties": {
"vmSize": "STANDARD_d4d_v5",
"interNodeCommunication": "Disabled",
"taskSlotsPerNode": 1,
"taskSchedulingPolicy": {"nodeFillType": "Spread"},
"deploymentConfiguration": {
"virtualMachineConfiguration": {
"imageReference": {
"publisher": "microsoft-dsvm",
"offer": "ubuntu-hpc",
"sku": "2204",
"version": "latest",
},
"nodeAgentSkuId": "batch.node.ubuntu 22.04",
"containerConfiguration": {
"type": "dockercompatible",
"containerImageNames": [os.environ["CONTAINER_IMAGE_NAME"]],
"containerRegistries": [
{
"identityReference": {
"resourceId": os.environ[
"BATCH_USER_ASSIGNED_IDENTITY"
]
},
"registryServer": os.environ[
"CONTAINER_REGISTRY_SERVER"
],
}
],
},
}
},
"networkConfiguration": {
"subnetId": os.environ["SUBNET_ID"],
"publicIPAddressConfiguration": {"provision": "NoPublicIPAddresses"},
"dynamicVnetAssignmentScope": "None",
},
"scaleSettings": {
"autoScale": {
"evaluationInterval": "PT5M",
"formula": AUTO_SCALE_FORMULA,
}
},
"resizeOperationStatus": {
"targetDedicatedNodes": 1,
"nodeDeallocationOption": "Requeue",
"resizeTimeout": "PT15M",
"startTime": "2023-07-05T13:18:25.7572321Z",
},
"currentDedicatedNodes": 0,
"currentLowPriorityNodes": 0,
"targetNodeCommunicationMode": "Simplified",
"currentNodeCommunicationMode": "Simplified",
},
}

batch_mgmt_client.pool.create(
resource_group_name=os.environ["RESOURCE_GROUP"],
account_name=os.environ["BATCH_ACCOUNT"],
pool_name=os.environ["POOL_ID"],
parameters=pool_parameters,
)


if __name__ == "__main__":
main()
194 changes: 118 additions & 76 deletions .github/workflows/containers-and-az-pool.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ env:
jobs:

build-pipeline-image:
permissions:
id-token: write # This is required for requesting the JWT
contents: read # This is required for actions/checkout
packages: write # This is required for ACR import
runs-on: ubuntu-latest
name: Build image

Expand Down Expand Up @@ -84,40 +88,66 @@ jobs:

acr-import:
needs: build-pipeline-image
runs-on: cfa-cdcgov-aca
runs-on: ubuntu-latest
environment: production
permissions:
id-token: write # This is required for requesting the JWT
contents: read # This is required for actions/checkout
packages: write # This is required for ACR import

name: Copy image from GHCR to ACR
outputs:
tag: ${{ needs.build-pipeline-image.outputs.tag }}
steps:

- name: Azure login with OIDC
uses: azure/login@v2
# From: https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#requesting-the-jwt-using-the-actions-core-toolkit
- name: Install OIDC Client from Core Package
run: npm install @actions/[email protected] @actions/http-client
- name: Get Id Token
uses: actions/github-script@v7
id: idtoken
with:
creds: ${{ secrets.EDAV_CFA_PREDICT_NNHT_SP }}
script: |
const coredemo = require('@actions/core')
const id_token = await coredemo.getIDToken('api://AzureADTokenExchange')
coredemo.setOutput('id_token', id_token)

- name: Copy Image
run: |
IMAGE_TAG=${{ env.IMAGE_NAME }}:${{ needs.build-pipeline-image.outputs.tag }}
az acr import --name ${{ env.REGISTRY }} \
--source "ghcr.io/cdcgov/$IMAGE_TAG" \
--username ${{ github.actor }} \
--password ${{ secrets.GITHUB_TOKEN }} \
--image "$IMAGE_TAG" \
--force && echo 'Copied image!'
if [ $? -ne 0 ]; then
echo "Failed to copy image"
fi
- name: ACR Import
uses: CDCgov/cfa-actions/[email protected]
with:
github_app_id: ${{ secrets.CDCENT_ACTOR_APP_ID }}
github_app_pem: ${{ secrets.CDCENT_ACTOR_APP_PEM }}
wait_for_completion: true
print_logs: true
script: |
echo "Logging into Azure CLI"
az login --service-principal \
--username ${{ secrets.AZURE_NNHT_SP_CLIENT_ID }} \
--tenant ${{ secrets.TENANT_ID }} \
--federated-token ${{ steps.idtoken.outputs.id_token }} \
--output none

IMAGE_TAG=${{ env.IMAGE_NAME }}:${{ needs.build-pipeline-image.outputs.tag }}
az acr import --name ${{ env.REGISTRY }} \
--source "ghcr.io/cdcgov/$IMAGE_TAG" \
--username ${{ github.actor }} \
--password ${{ secrets.GITHUB_TOKEN }} \
--image "$IMAGE_TAG" \
--force && echo 'Copied image!'

if [ $? -ne 0 ]; then
echo "Failed to copy image"
fi

batch-pool:

name: Create Batch Pool and Submit Jobs
runs-on: cfa-cdcgov-aca
runs-on: ubuntu-latest
needs: acr-import

environment: production
permissions:
contents: read
packages: write
id-token: write

env:
TAG: ${{ needs.acr-import.outputs.tag }}
Expand All @@ -136,65 +166,77 @@ jobs:
id: checkout_repo
uses: actions/checkout@v4

# This step is only needed during the action to write the
# config file. Users can have a config file stored in their VAP
# sessions. In the future, we will have the config.toml file
# distributed with the repo (encrypted).
- name: Writing out config file
run: |
cat <<EOF > pool-config-${{ github.sha }}.toml
${{ secrets.POOL_CONFIG_TOML }}
EOF

# Replacing placeholders in the config file
sed -i 's|{{ IMAGE_NAME }}|${{ env.REGISTRY }}${{ env.IMAGE_NAME }}:${{ env.TAG }}|g' pool-config-${{ github.sha }}.toml
sed -i 's|{{ VM_SIZE }}|${{ env.VM_SIZE }}|g' pool-config-${{ github.sha }}.toml
sed -i 's|{{ BATCH_SUBNET_ID }}|${{ env.BATCH_SUBNET_ID }}|g' pool-config-${{ github.sha }}.toml
sed -i 's|{{ POOL_ID }}|${{ env.POOL_ID }}|g' pool-config-${{ github.sha }}.toml


- name: Login to Azure with NNH Service Principal
id: azure_login_2
uses: azure/login@v2
# From: https://stackoverflow.com/a/58035262/2097171
- name: Extract branch name
shell: bash
run: echo "branch=${GITHUB_HEAD_REF:-${GITHUB_REF#refs/heads/}}" >> $GITHUB_OUTPUT
id: get-branch

# From: https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-cloud-providers#requesting-the-jwt-using-the-actions-core-toolkit
- name: Install OIDC Client from Core Package
run: npm install @actions/[email protected] @actions/http-client
- name: Get Id Token
uses: actions/github-script@v7
id: idtoken
with:
# managed by EDAV. Contact Amit Mantri or Jon Kislin if you have issues.
creds: ${{ secrets.EDAV_CFA_PREDICT_NNHT_SP }}

#########################################################################
# Checking if the pool exists
# This is done via az batch pool list. If there is no pool matching the
# pool id (which is a function of the tag, i.e., branch name), then we
# pool-exists will be ''.
#########################################################################
- name: Check if pool exists
id: check_pool_id
run: |

az batch account login \
--resource-group ${{ secrets.PRD_RESOURCE_GROUP }} \
--name "${{ env.BATCH_ACCOUNT }}"
script: |
const coredemo = require('@actions/core')
const id_token = await coredemo.getIDToken('api://AzureADTokenExchange')
coredemo.setOutput('id_token', id_token)

az batch pool list \
--output tsv \
--filter "(id eq '${{ env.POOL_ID }}')" \
--query "[].[id, allocationState, creationTime]" > \
pool-list-${{ github.sha }}

echo "pool-exists=$(cat pool-list-${{ github.sha }})" >> \
$GITHUB_OUTPUT

- name: Create cfa-epinow2-pipeline Pool
id: create_batch_pool

# This is a conditional step that will only run if the pool does not
# exist
if: ${{ steps.check_pool_id.outputs.pool-exists == '' }}

# The call to the az cli that actually generates the pool
run: |
# Running the python script azure/pool.py passing the config file
# as an argument
pip install -r azure/requirements.txt
python3 azure/pool.py \
pool-config-${{ github.sha }}.toml \
batch-autoscale-formula.txt
# Removed invalid condition referencing steps.check_pool_id.outputs.pool-exists
uses: CDCgov/cfa-actions/[email protected]
with:
github_app_id: ${{ secrets.CDCENT_ACTOR_APP_ID }}
github_app_pem: ${{ secrets.CDCENT_ACTOR_APP_PEM }}
wait_for_completion: true
print_logs: true
script: |
echo "Setting env vars"
export BATCH_ACCOUNT=${{ secrets.BATCH_ACCOUNT }}
export SUBSCRIPTION_ID=${{ secrets.SUBSCRIPTION_ID }}
export BATCH_USER_ASSIGNED_IDENTITY=${{ secrets.BATCH_USER_ASSIGNED_IDENTITY }}
export AZURE_BATCH_ACCOUNT_CLIENT_ID=${{ secrets.AZURE_BATCH_ACCOUNT_CLIENT_ID }}
export PRINCIPAL_ID=${{ secrets.PRINCIPAL_ID }}
export CONTAINER_REGISTRY_SERVER=${{ secrets.CONTAINER_REGISTRY_SERVER }}
export CONTAINER_REGISTRY_USERNAME=${{ secrets.CONTAINER_REGISTRY_USERNAME }}
export CONTAINER_REGISTRY_PASSWORD=${{ secrets.CONTAINER_REGISTRY_PASSWORD }}
export CONTAINER_REGISTRY_URL=${{ secrets.CONTAINER_REGISTRY_URL }}
export CONTAINER_IMAGE_NAME=${{ env.REGISTRY }}${{ env.IMAGE_NAME }}:${{ env.TAG }}
export POOL_ID=${{ env.POOL_ID }}
export SUBNET_ID=${{ secrets.BATCH_SUBNET_ID }}
export RESOURCE_GROUP=${{ secrets.RESOURCE_GROUP }}


echo "Logging into Azure CLI"
az login --service-principal \
--username ${{ secrets.AZURE_NNHT_SP_CLIENT_ID }} \
--tenant ${{ secrets.TENANT_ID }} \
--federated-token ${{ steps.idtoken.outputs.id_token }} \
--output none

echo "Logging into batch"
az batch account login \
--resource-group ${{ secrets.PRD_RESOURCE_GROUP }} \
--name "${{ env.BATCH_ACCOUNT }}"

echo "Listing batch pools"
az batch pool list \
--output tsv \
--filter "(id eq '${{ env.POOL_ID }}')" \
--query "[].[id, allocationState, creationTime]" > pool-list-${{ github.sha }}

if [ -s pool-list-${{ github.sha }} ]; then
echo "Pool already exists!"
else
CURRENT_BRANCH="${{ steps.get-branch.outputs.branch }}"
echo "Cloning repo at branch '$CURRENT_BRANCH'"
git clone -b "$CURRENT_BRANCH" https://github.com/${{ github.repository }}.git
cd cfa-epinow2-pipeline

echo "Running create pool script"
uv run .github/scripts/create_pool.py
fi
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -389,3 +389,6 @@ azure/*.toml
# Careful with Secrets!
*.env
*.env.gpg

# vscode settings
.vscode/
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,4 @@ This initial release establishes minimal feature parity with the internal EpiNow
* Comprehensive documentation of pipeline code and validation of input data, parameters, and model run configs
* Set up comprehensive logging of model runs and handle pipeline failures to preserve logs where possible
* Automatically download and upload inputs and outputs from Azure Blob Storage
* A new script for building the pool. Runnable from CLI or GHA. Requires `uv` be installed, and then `uv` handles the python and dependency management based on the inline script metadata.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ The project has multiple GitHub Actions workflows to automate the CI/CD process.

- **Create Batch Pool and Submit Jobs** (`batch-pool`): This final job creates a new Azure batch pool with id `cfa-epinow2-pool-[branch name]` if it doesn't already exist. Additionally, if the commit message contains the string "`[delete pool]`", the pool is deleted.

Both container tags and pool ids are based on the branch name, making it compatible with having multiple pipelines running simultaneously. The pool creation depends on Azure's Python SDK (see the file [azure/pool.py](azure/pool.py)), with the necessary configuration in a toml file stored as a secret in the repository (`POOL_CONFIG_TOML`). A template of the configuration file can be found at [azure/pool-config-template.toml](azure/pool-config-template.toml). The current configuration file is stored in the project's Azure datalake under the name `cfa-epinow2-pipeline-config.toml.toml`.
Both container tags and pool ids are based on the branch name, making it compatible with having multiple pipelines running simultaneously. The pool creation depends on Azure's Python SDK (see the file [.github/scripts/create_pool.py](.github/scripts/create_pool.py)), with the necessary credentials listed in a string at the top of the script.

> [!IMPORTANT]
> The CI will fail with branch names that are not valid tag names for containers. For more information, see the official Azure documentation [here](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-name-rules#microsoftcontainerregistry).
Expand Down
Loading
Loading