-
Notifications
You must be signed in to change notification settings - Fork 107
[SlurmTopo] Add support for slurm Block Topology #3002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Outdated
Show resolved
Hide resolved
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Outdated
Show resolved
Hide resolved
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Outdated
Show resolved
Hide resolved
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Outdated
Show resolved
Hide resolved
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Fixed
Show fixed
Hide fixed
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Fixed
Show fixed
Hide fixed
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3002 +/- ##
===========================================
- Coverage 75.50% 75.24% -0.27%
===========================================
Files 23 24 +1
Lines 2356 2436 +80
===========================================
+ Hits 1779 1833 +54
- Misses 577 603 +26
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
17b882a
to
544b10b
Compare
3a43f29
to
84276a2
Compare
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Fixed
Show fixed
Hide fixed
...aws-parallelcluster-slurm/files/default/head_node_slurm/slurm/pcluster_topology_generator.py
Fixed
Show fixed
Hide fixed
cookbooks/aws-parallelcluster-slurm/resources/block_topology/partial/_block_topology_common.rb
Show resolved
Hide resolved
owner 'root' | ||
group 'root' | ||
mode '0644' | ||
variables(is_amazon_linux_2: platform?('amazon') && node['platform_version'] == "2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are injecting this variable to skip the inclusion of topology configuration in slurm.conf.
What about having a more generic is_topology_plugin_supported
rather than an Os specific variable?
The advantage of the generic approach is to be more open to future extensions. Example: if tomorrow we must skip it for another OS, we will not need another variable dedicated to the additional Os, but simply include the Os in the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update and keep it is_block_topology_plugin_supported
.
IF in future we add any more plugins its not necessary that AL2 will not be supported for them
...cluster-slurm/templates/default/slurm/block_topology/slurm_parallelcluster_topology.conf.erb
Show resolved
Hide resolved
cookbooks/aws-parallelcluster-slurm/templates/default/slurm/slurm.conf.erb
Outdated
Show resolved
Hide resolved
… update * Updating to use correct paramter name of block_sizes
… as per CLI changes
continue | ||
|
||
# Check for if reservation is for NVLink and size matches min_block_size_list | ||
if compute_resource_config.get("InstanceType") == "p6e-gb200.36xlarge": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just this instance type? .72xlarge? Any way to more programatically determine these?
Description of changes
pcluster_topology_generator.py
is used for creating topology.conf file on HeadNodep6egb200_block_sizes
node attribute is added only for Gb200 instanceWe create
slurm_parallelcluster_topology.conf
which is included as part of slurm.conf and created with any the HeadNode instance, however we only enable theTopologyPlugin=topology/block
only for gb200 instance.Tests
AL2023
creation is successful
p6egb200_block_sizes: 1
and also updated the scheduling section to use 1 static nodeand
AL2023
scontrol show topology
doesnt return anything and returns empty lineAL2
* Force Updated same cluster to use
ExtraChefAttributes: | {"cluster": {"p6egb200_block_sizes": "2" }}
and updated the scheduling section to use 2 nodes. Result was none of the conf files were createdSTATUS: DRAFT as I have commits added for testing
References
aws/aws-parallelcluster#6928
#3001
#2996
aws/aws-parallelcluster#6930
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.