Skip to content

Conversation

himani2411
Copy link
Contributor

@himani2411 himani2411 commented Aug 8, 2025

Description of changes

  • pcluster_topology_generator.py is used for creating topology.conf file on HeadNode
  • We create this file only for Gb200 instance based on the GB200 CLI change where p6egb200_block_sizes node attribute is added only for Gb200 instance
    We create slurm_parallelcluster_topology.conf which is included as part of slurm.conf and created with any the HeadNode instance, however we only enable the TopologyPlugin=topology/block only for gb200 instance.
# slurm_parallelcluster_topology.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
# options
# TOPOLOGY Plugin
TopologyPlugin=topology/block

Tests

AL2023

  1. Test using below setting and commenting out the checks for CAPACITY_BLOCK and gb200 instance
DevSettings:
  Cookbook:
    ExtraChefAttributes: |
      {"cluster": {"p6egb200_block_sizes": "2" }}
#    ChefCookbook: s3://parallelcluster-d3b311f04d04f70d-v1-do-not-delete/parallelcluster/3.14.0/cookbooks/aws-parallelcluster-cookbook-3.14.0.tgz
    ChefCookbook: https://github.com/himani2411/aws-parallelcluster-cookbook/tarball/slurm-topo

creation is successful

> cat slurm_parallelcluster_topology.conf
# slurm_parallelcluster_topology.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
# options
# TOPOLOGY Plugin
TopologyPlugin=topology/block
> cat topology.conf
# This file is automatically generated by pcluster

BlockName=Block1  Nodes=queue1-st-cr1-[1-2]
BlockSizes=2
> scontrol show topology
BlockName=Block1 BlockIndex=0 Nodes=queue1-st-cr1-[1-2] BlockSize=2
  1. Force Updated the cluster to use p6egb200_block_sizes: 1 and also updated the scheduling section to use 1 static node
>scontrol show topology
BlockName=Block1 BlockIndex=0 Nodes=queue1-st-cr1-[1-1] BlockSize=1

and

cat topology.conf
# This file is automatically generated by pcluster

BlockName=Block1  Nodes=queue1-st-cr1-[1-1]
  1. Force Updated to use no ExtraChefAttribute and update scheduling section to 0 and removes the files
  ls -al
total 72
drwxr-xr-x  4 root root 16384 Aug  8 19:31 .
drwxr-xr-x. 9 root root    92 Aug  8 19:18 ..
-rw-r--r--  1 root root   249 Aug  8 19:18 cgroup.conf
-rw-r--r--  1 root root   174 Aug  8 19:18 gres.conf
drwxr-xr-x  3 root root 16384 Aug  8 19:18 pcluster
drwxr-xr-x  4 root root    38 Aug  8 19:18 scripts
-rw-r--r--  1 root root  2304 Aug  8 19:18 slurm.conf
-rwxr-xr-x  1 root root   233 Aug  8 19:18 slurm.csh
-rwxr-xr-x  1 root root   140 Aug  8 19:18 slurm.sh
-rw-r--r--  1 root root   394 Aug  8 19:31 slurm_parallelcluster.conf
-rw-r--r--  1 root root   155 Aug  8 19:31 slurm_parallelcluster_cgroup.conf
-rw-r--r--  1 root root   161 Aug  8 19:31 slurm_parallelcluster_gres.conf
-rw-r--r--  1 root root   168 Aug  8 19:31 slurm_parallelcluster_slurmdbd.conf
-rw-r--r--  1 root root   237 Aug  8 19:31 slurm_parallelcluster_topology.conf
$ cat slurm_parallelcluster_topology.conf
# slurm_parallelcluster_topology.conf is managed by the pcluster processes.
# Do not modify.
# Please use CustomSlurmSettings in the ParallelCluster configuration file to add user-specific slurm configuration
# options
# TOPOLOGY Plugin
[ec2-user@ip-192-168-9-4 etc]$
[ec2-user@ip-192-168-9-4 etc]$ scontrol show topology
[ec2-user@ip-192-168-9-4 etc]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
queue1*      up   infinite     10  idle~ queue1-dy-cr1-[1-10]
[ec2-user@ip-192-168-9-4 etc]$

AL2023

  • Normal cluster without ExtraChefAttributes so scontrol show topology doesnt return anything and returns empty line

AL2

  • Test with AL2 cluster without ExtraChefAttributes and we do not create any of the topology configuration files
    * Force Updated same cluster to use ExtraChefAttributes: | {"cluster": {"p6egb200_block_sizes": "2" }} and updated the scheduling section to use 2 nodes. Result was none of the conf files were created

STATUS: DRAFT as I have commits added for testing

References

aws/aws-parallelcluster#6928
#3001
#2996
aws/aws-parallelcluster#6930

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@himani2411 himani2411 requested review from a team as code owners August 8, 2025 00:11
@himani2411 himani2411 marked this pull request as draft August 8, 2025 00:11
Copy link

codecov bot commented Aug 8, 2025

Codecov Report

❌ Patch coverage is 68.35443% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.24%. Comparing base (6127e18) to head (33287b8).
⚠️ Report is 18 commits behind head on develop.

Files with missing lines Patch % Lines
...ad_node_slurm/slurm/pcluster_topology_generator.py 68.35% 25 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3002      +/-   ##
===========================================
- Coverage    75.50%   75.24%   -0.27%     
===========================================
  Files           23       24       +1     
  Lines         2356     2436      +80     
===========================================
+ Hits          1779     1833      +54     
- Misses         577      603      +26     
Flag Coverage Δ
unittests 75.24% <68.35%> (-0.27%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@himani2411 himani2411 force-pushed the slurm-topo branch 2 times, most recently from 3a43f29 to 84276a2 Compare August 9, 2025 02:58
owner 'root'
group 'root'
mode '0644'
variables(is_amazon_linux_2: platform?('amazon') && node['platform_version'] == "2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are injecting this variable to skip the inclusion of topology configuration in slurm.conf.
What about having a more generic is_topology_plugin_supported rather than an Os specific variable?

The advantage of the generic approach is to be more open to future extensions. Example: if tomorrow we must skip it for another OS, we will not need another variable dedicated to the additional Os, but simply include the Os in the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update and keep it is_block_topology_plugin_supported.

IF in future we add any more plugins its not necessary that AL2 will not be supported for them

@himani2411 himani2411 marked this pull request as ready for review August 11, 2025 23:06
@himani2411 himani2411 enabled auto-merge (rebase) August 12, 2025 13:03
@himani2411 himani2411 disabled auto-merge August 12, 2025 13:03
@himani2411 himani2411 enabled auto-merge (squash) August 12, 2025 13:03
gmarciani
gmarciani previously approved these changes Aug 12, 2025
gmarciani
gmarciani previously approved these changes Aug 12, 2025
@himani2411 himani2411 merged commit 7ffb589 into aws:develop Aug 12, 2025
28 of 30 checks passed
continue

# Check for if reservation is for NVLink and size matches min_block_size_list
if compute_resource_config.get("InstanceType") == "p6e-gb200.36xlarge":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just this instance type? .72xlarge? Any way to more programatically determine these?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants