Skip to content

Lifecycle scripts should support API-Driven Configuration without provisioning_parameters.json #1028

@KeitaW

Description

@KeitaW

Summary

The HyperPod documentation now recommends API-Driven Configuration over the legacy provisioning_parameters.json approach for Slurm cluster setup. However, the base lifecycle scripts in 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ still require provisioning_parameters.json to function, creating a contradiction between the recommended API-driven path and the scripts that users are directed to use.

Current Behavior

  • lifecycle_script.py reads provisioning_parameters.json and passes its contents to downstream scripts (mount_fsx.sh, start_slurm.sh, setup_mariadb_accounting.sh, etc.)
  • on_create.sh expects provisioning_parameters.json to exist alongside the lifecycle scripts in S3
  • Users following the API-Driven Configuration path (using SlurmConfig and InstanceStorageConfigs in the CreateCluster API) still need to provide a provisioning_parameters.json file if they use the base lifecycle scripts

Expected Behavior

When using API-Driven Configuration, the lifecycle scripts should be able to derive all necessary configuration from resource_config.json (auto-generated by HyperPod) and the API-injected instance metadata, without requiring a separate provisioning_parameters.json.

Relevant Documentation

Suggested Changes

  1. Update lifecycle_script.py to detect whether API-Driven Configuration is in use (e.g., check for SlurmConfig in resource_config.json) and fall back to provisioning_parameters.json only when API-driven config is absent
  2. Update mount_fsx.sh to read FSx mount information from instance metadata or resource_config.json when available via InstanceStorageConfigs
  3. Update start_slurm.sh to derive Slurm node assignments from resource_config.json when SlurmConfig is provided via the API
  4. Maintain backward compatibility — provisioning_parameters.json should continue to work for existing users

Files Affected

  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/lifecycle_script.py
  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/on_create.sh
  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/mount_fsx.sh
  • 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/start_slurm.sh

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions