[CLI][GB200] Add ultraserver instance(p6e-gb200) capacity block support #6928

hehe7318 · 2025-08-01T19:05:22Z

Description of changes

Add ultraserver instance support with capacity block validation
- Add CapacityBlockHealthStatusValidator for UltraServer instance check.
  - The validator also collect the capacity blocks that are in valid CAPACITY_BLOCK_INACTIVE_STATES. (for example: state scheduled). In this scenario we throw a warning and report the ids and the states.
- Add describe_capacity_block_status api in ec2.py
- Add logic to collect all ultraserver instance capacity block ids in a dict, key -> ULTRASERVER_INSTANCE_PREFIX, value -> list of capacity block ids
- Add ULTRASERVER_INSTANCE_PREFIX_LIST and ULTRASERVER_CAPACITY_BLOCK_ALLOWED_SIZE_DICT constants
- Implement capacity block size validation for p6e-gb200 instances (9, 18 nodes)
- Add ultraserver-specific template generation and validation logic
- Move get_instance_type_and_reservation_type_from_capacity_reservation to ec2.py
Enhance template and configuration handling
- Add capacity block support in cluster and queue stack templates
  - Pass ultrasever capacity block size to HeadNode dna.json. For p6e-gb200, "9", "18" or "9, 18".
  - Add has_ultraserver_instance to detect if the queue stack compute resources have ultraserver instance

Tests

Add ultraserver capacity block test suites
Add ultraserver cluster and queue stack tests
Fix test mocks to return proper CapacityReservationInfo objects
Add describe_capacity_reservations method to dummy EC2 client
Update existing tests for refactored capacity reservation function

…servation handling - Add ultraserver instance support with capacity block validation - Add ULTRASERVER_INSTANCE_PREFIX_LIST and ULTRASERVER_CAPACITY_BLOCK_ALLOWED_SIZE_DICT constants - Implement capacity block size validation for p6e-gb200 instances (9, 18 nodes) - Add ultraserver-specific template generation and validation logic - Refactor capacity reservation function organization - Move get_instance_type_and_reservation_type_from_capacity_reservation from utils.py to ec2.py - Convert to Ec2Client method to avoid circular imports and improve code organization - Update all imports and function calls across codebase to use new location - Leverage existing capacity_reservations_cache for better performance - Enhance template and configuration handling - Add capacity block support in cluster and queue stack templates - Update CDK builder utils with ultraserver-specific resource naming - Extend cluster config validation for ultraserver instances - tests - Add ultraserver capacity block test suites - Add ultraserver cluster and queue stack tests - Fix test mocks to return proper CapacityReservationInfo objects - Add describe_capacity_reservations method to dummy EC2 client - Update existing tests for refactored capacity reservation function

…hanges on mock method and dummy api.

CHANGELOG.md

gmarciani · 2025-08-04T13:25:51Z

cli/src/pcluster/aws/ec2.py

+        for page in page_iterator:
+            statuses.extend(page.get("CapacityBlockStatuses", []))
+
+        return {"CapacityBlockStatuses": statuses}


[major] The function accepts multiple capacity blocks ids (great!), but it returns a dictionary that does not allow the caller to understand which status map to which reservation.
The caller expects the function to return something about statuses. It is redundant to return a dictionary where the results are mapped to a key CapacityBlockStatuses. It's like having a function sum(a,b) that instead of returning a+b returns {"sumResult": a+b}.

The boto3 describe_capacity_block_status api call returns a dict: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2/client/describe_capacity_block_status.html#

Do we want to change the default behavior of this api call?

I got your point. The default behavior returns dict because it has key NextToken. But We have already handled it by paginator. Okay, I agree, I will let it directly return a list.

cli/src/pcluster/aws/ec2.py

cli/src/pcluster/config/cluster_config.py

gmarciani · 2025-08-04T13:37:10Z

cli/src/pcluster/config/cluster_config.py

+        for queue in self.scheduling.queues:
+            for compute_resource in queue.compute_resources:
+                cr_target = compute_resource.capacity_reservation_target or queue.capacity_reservation_target
+                if cr_target and cr_target.capacity_reservation_id:


what if the config specifies a resource group ARN rather than a reservation id?

We only accept capacity block for ultraserver instance. ResourceGroupArn is not considered here.

is this a limitation captured in our code, i.e. we fail if we specify a capacity block and CapacityReservationResourceGroupArn does the validation fail? At least in documentation it does not seems so: https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-CapacityReservationTarget

You mean specify CapacityBlock and and CapacityReservationResourceGroupArn at the same time?
I think they can not be specified at the same time because CapacityReservationResourceGroupArn needs Instances section. https://docs.aws.amazon.com/parallelcluster/latest/ug/launch-instances-odcr-v3.html

ComputeResources: - ... Instances: - InstanceType: instance

But if it's CapacityBlock, you have to use InstanceType section or leave it empty(optional). https://docs.aws.amazon.com/parallelcluster/latest/ug/launch-instances-capacity-blocks.html

ComputeResources: - ... InstanceType: String (EC2 Instance type of the CB)

But even if it doesn't fail, why it matters in our case? This function is to collect ultraserver_capacity_block_dict, ultraserver instance can not in a CapacityReservationResourceGroupArn, right?

The use case we are talking about is: a compute resource with CAPACITY BLOCK, instance type specified and a resourcegroupArn rather than a reservationId.

I think we should support this scenario. If we do not support it, let's verify that that there exist a validator to prevent it. If it does not exist, the next steps are:
1.If the team agrees on supporting the use case, let's address it in a follow up PR
2. if the team agrees on not supporting it, let;s add a validator to prevent it in a follow up PR.

I had a sync with Giacomo. We are not going to support it because: >Capacity Reservation Groups are designed to work with standard Capacity Reservations, not with Capacity Blocks. They allow you to group and manage multiple Capacity Reservations together, but this functionality does not extend to Capacity Blocks.

gmarciani · 2025-08-04T13:42:23Z

cli/src/pcluster/templates/cluster_stack.py

                    ),
                    "launch_template_id": launch_template_id,
+                    **(
+                        {"p6e_gb200_capacity_block_sizes": cluster_ultraserver_capacity_block_sizes_dict["p6e-gb200"]}


We are injecting into dna.json a mapping between instance type prefix and block size? Shouldn't the cookbook know exactly what block size a specific reservation has?
Also, given that we are getting closer to the template size limit, what about making the key shorter: Eg: p6egb200_block_sizes?

We are injecting into dna.json a mapping between instance type prefix and block size? Shouldn't the cookbook know exactly what block size a specific reservation has?

This is a valid point. Yes, we are injecting.
Regarding g6e-gb200 instances specifically, the block size values are constrained to [9, 18]. According to Himani, these values can be combined. So, when performing the injection, we have three possible approaches:

Single value "9"

Single value "18"

Combined value "9, 18"

what about making the key shorter: Eg: p6egb200_block_sizes

Okay, done.

gmarciani · 2025-08-04T13:46:29Z

cli/src/pcluster/validators/ec2_validators.py

    return capacity_reservations_per_az


+class CapacityBlockHealthStatusValidator(Validator):


[major] When the cluster is created, the capacity block may be inactive and the status information may be unavailable. In this case, the validator should not fail because it is legit for the user to create the cluster when the capacity block is not active yet.

I had a talk with Himani and we reached an agreement on this. According to the design doc, this validator is needed. I think for ultraserver instance, it's not like general instance types. Can we keep it for now and have a talk with Himani to reach a final decision?

I'm ok with returning a warning because it does not block the creation. However: we should return the warning the message should be more user friendly. Instead of DescribeCapacityBlockStatus returned no entries for the provided Capacity Blocks; unable to verify health. I'd say Could not verify health status for capacity block {reservation_id}. Please check the capacity reservation..

Done:

Update CapacityBlockHealthStatusValidator to support inactive capacity block states

Add CAPACITY_BLOCK_INACTIVE_STATES constant for scheduled, payment-pending, assessing, delayed states

Change validation logic to check for good interconnect status ('ok') instead of bad states

Replace TotalUnavailableCapacity check with TotalInstanceCount vs AvailableInstanceCount comparison

Add fallback to describe_capacity_reservations for inactive capacity blocks

Issue warnings for inactive capacity blocks instead of errors

Since this does not impact the update path, I'm ok with the error.

gmarciani · 2025-08-04T13:51:45Z

cli/src/pcluster/templates/cdk_builder_utils.py

+            for size in unique_sizes:
+                if size not in allowed_sizes_list:
+                    raise BadRequestException(
+                        f"The capacity block sizes for ultraserver instance {ultraserver_instance_prefix} are "


[major] What is the user experience in this case? Will the error surfaced to the users allow them to immediately know what capacity reservation has the unsupported block size?

First, to clarify, when making capacity block reservation for ultraserver instance, you can only choose the block size from a fixed block size values list, for example, p6e-gb200 allowed block sizes are 9 and 18.

So usually this exception will not be raised. I added it just in case in the future something changes.

But if it's raised, it will tell customer your current capacity block sizes for {ultraserver_instance_prefix}(for example p6e-gb200) are like [9, 15, 18]. The allowed list is [9, 18]. So they can check their capacity block and find why there's a 15.

what if the user has multiple CB for the same instance type family and only one of those is misconfigured? In that case the user must go through every reservation to spot the problem. The best user experience would be to return an error message like The capacity block {reservation_id} has invalid block size {detected_size}. Allowed values are {allowed values}.

Is there any blocker to implement this in this PR?

Okay, got it, implementing.

hgreebe · 2025-08-04T14:25:28Z

cli/src/pcluster/templates/queues_stack.py

            compute_lt_nw_interfaces.append(
                ec2.CfnLaunchTemplate.NetworkInterfaceProperty(
-                    device_index=0 if network_card.maximum_network_interfaces() == 1 else 1,
+                    # Set device_index=0 for ultraserver instances or single-NIC network cards


device_index should be 0 for only Odd numbered device indexes. This is already covered in the existing logic because the odd device indexes have maximum_network_interfaces() = 1.

Thank you! Got it, it's a bit different from Himani's design, did you two reach an agreement?

device_index should be 0 for only Odd numbered device indexes.

This is my understanding as well

hgreebe · 2025-08-04T14:27:21Z

cli/src/pcluster/templates/cdk_builder_utils.py

+    """
+    Check if the compute resource uses ultraserver instances with capacity blocks.
+
+    Ultraserver instances (e.g., p6e-gb200) require special network interface configuration


This is not quite right. Only the odd indexed cards need the device_index = 0.

hgreebe · 2025-08-04T14:34:17Z

cli/src/pcluster/config/cluster_config.py

+            (
+                instance_type_from_capacity_reservation,
+                _,
+            ) = AWSApi.instance().ec2.get_instance_type_and_reservation_type_from_capacity_reservation(


What are the benefits of creating an entirely new function rather than just using/building upon the existing _instance_type_from_capacity_reservation() function. It seems like we are repeating pre-existing logic.

To support ultraserver instance I need to also get the reservation_type to check if it's "capacity-block".

I need this function in other places to handle the truth that the capacity_reservation_target section can under both queue level and compute resource level.

Also I need this function when building stack, if keep it here, there's a circular dependencies issue.

…a list rather than a dict with key CapacityBlockStatuses.

…ve device index validation - Update CapacityBlockHealthStatusValidator to support inactive capacity block states - Add CAPACITY_BLOCK_INACTIVE_STATES constant for scheduled, payment-pending, assessing, delayed states - Change validation logic to check for good interconnect status ('ok') instead of bad states - Replace TotalUnavailableCapacity check with TotalInstanceCount vs AvailableInstanceCount comparison - Add fallback to describe_capacity_reservations for inactive capacity blocks - Issue warnings for inactive capacity blocks instead of errors - Remove device index validation from queue stack processing - Update related unit tests to reflect new validation behavior

cli/src/pcluster/templates/cdk_builder_utils.py

gmarciani · 2025-08-05T21:45:57Z

cli/src/pcluster/templates/cdk_builder_utils.py

+                size = status.get("TotalCapacity")
+                if size is not None:
+                    if size not in allowed_sizes_list:
+                        raise BadRequestException(


[major] if you raise the exception here and you have multiple invalid statuses, you will only surface the first one to the user, whereas it could be useful to provide the full list of reservations with invalid status.

You are right, done.

gmarciani · 2025-08-05T21:47:13Z

cli/src/pcluster/templates/cluster_stack.py

+        # This section collects capacity block sizes for ultraserver instances (e.g., p6e-gb200)
+        # and validates that they conform to allowed size configurations for Slurm topology
+        cluster_ultraserver_capacity_block_sizes_dict = {}
+        if self.config.scheduling.scheduler == "slurm":


[minor] Use the constants whenever it is possible.

aws-parallelcluster/cli/src/pcluster/constants.py

Line 48 in edf576d

SLURM = "slurm"

gmarciani · 2025-08-05T21:49:51Z

cli/src/pcluster/constants.py

 PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_REVISION = 1
 PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_BOOTSTRAP_TAG_KEY = "parallelcluster:build-image-cleanup-role-bootstrapped"
+
+ULTRASERVER_INSTANCE_PREFIX_LIST = ["p6e-gb200"]


[codestyle] This instance type prefix is reused many times in this PR.
It may make sense to have a constant P6E_GB200=p6e-gb200 defined here and reused everywhere in the code.

gmarciani · 2025-08-05T21:56:22Z

cli/src/pcluster/validators/ec2_validators.py

+                        # Not in inactive state, so it's a real error
+                        self._add_failure(
+                            f"Unable to retrieve status for Capacity Block {cb_id}. "
+                            f"Please verify the Capacity Block ID is valid and accessible.",


[major] This case may occur when the user specifies a non existing capacity block. I think we should surface a warning rather than an error because this case occurs if the user specifies a non existing reservation and:

in this case we already have a validator checking the existance of the reservation, so no reason to have two failures

this failure can block the cluster update. Example: the user tries to remove the reservation from the config via clusert update because the reservation does not exist anymore.

That doesn't convince me:

in this case we already have a validator checking the existance of the reservation, so no reason to have two failures

That's not true. We need it. See link under State -> (string). Think about when customer try to create a cluster with a CB that is in unsupported state. describe_capacity_block_status failed so we enter this except block. describe_capacity_reservations succeeds, and output is State -> unsupported. In this case we want to prevent customer from create the cluster because: The Capacity Reservation will not be delivered.

this failure can block the cluster update. Example: the user tries to remove the reservation from the config via clusert update because the reservation does not exist anymore.

Correct me if I am wrong. It can not, when update, the cli checks the updated config, which means when customer remove the CB from the config, the update will not check the previous one.

Same as above. If it does not impact the update, I'm ok with the error

gmarciani · 2025-08-05T21:57:03Z

cli/src/pcluster/validators/ec2_validators.py

+        # Report inactive capacity blocks as warnings
+        if inactive_blocks:
+            self._add_failure(
+                f"Cannot verify health status for inactive Capacity Blocks: {', '.join(inactive_blocks)}. "


[Minor] Capacity Block -> capacity block

gmarciani · 2025-08-05T21:58:11Z

cli/src/pcluster/validators/ec2_validators.py

+        if inactive_blocks:
+            self._add_failure(
+                f"Cannot verify health status for inactive Capacity Blocks: {', '.join(inactive_blocks)}. "
+                f"Health status validation will be performed when the capacity blocks become active.",


Not necessary to specify that health status would be checked when the reservation is active.

Done! Weird ^_^. I remembered I removed it.

gmarciani · 2025-08-05T22:00:09Z

cli/src/pcluster/validators/ec2_validators.py

+                (
+                    "One or more Capacity Blocks are not healthy or have insufficient capacity: "
+                    + "; ".join(unhealthy_details)
+                    + ". Please ensure each Capacity Block reports InterconnectStatus='ok' and "


I'd rephrase to capacity block is healthy and all reserved capacity is available. The fact that we use a specific field to determine the healthiness is an internal detail. If you disagree and want to keep it, let's inject the _GOOD_INTERCONNECT in the message at least

Make sense, you convinced me. Done!

himani2411 · 2025-08-06T16:10:29Z

cli/src/pcluster/templates/cluster_stack.py

+                    **(
+                        {"p6egb200_block_sizes": cluster_ultraserver_capacity_block_sizes_dict["p6e-gb200"]}
+                        if "p6e-gb200" in cluster_ultraserver_capacity_block_sizes_dict
+                        and cluster_ultraserver_capacity_block_sizes_dict["p6e-gb200"]


Can you add unit tests for checking this is generated as expected?
It would also be easier to understand the structure using unit tests

IS there a reason why the name is p6egb200_block_sizes? why not a generaic name topology_block_size or just block_size?

Can you add unit tests for checking this is generated as expected?
It would also be easier to understand the structure using unit tests

I believe I already added it in test_ultraserver_cluster_stack.py.

IS there a reason why the name is p6egb200_block_sizes? why not a generaic name topology_block_size or just block_size?

Because in the future if other ultraserver instance types are introduced, don't we want another section in dna.json?

…city_block_sizes. Add constant for code-style. Fix unit tests for new changes.

cli/src/pcluster/templates/cdk_builder_utils.py

gmarciani · 2025-08-06T17:30:41Z

cli/src/pcluster/validators/ec2_validators.py

    return capacity_reservations_per_az


+class CapacityBlockHealthStatusValidator(Validator):


Since this does not impact the update path, I'm ok with the error.

gmarciani · 2025-08-06T17:31:40Z

cli/src/pcluster/validators/ec2_validators.py

+                        # Not in inactive state, so it's a real error
+                        self._add_failure(
+                            f"Unable to retrieve status for Capacity Block {cb_id}. "
+                            f"Please verify the Capacity Block ID is valid and accessible.",


Same as above. If it does not impact the update, I'm ok with the error

hehe7318 added 2 commits July 29, 2025 17:24

Add CapacityBlockHealthStatusValidator for UltraServer instance check.

3b95735

hehe7318 requested a review from a team as a code owner August 1, 2025 19:05

hehe7318 added the 3.x label Aug 1, 2025

hehe7318 requested a review from a team as a code owner August 1, 2025 19:05

hehe7318 and others added 2 commits August 1, 2025 16:21

Fix all existing unit tests failure. No changes in main logic. Only c…

ba7f627

…hanges on mock method and dummy api.

Merge branch 'develop' into wip/GB200-validator-and-LT

a19dac4

gmarciani reviewed Aug 4, 2025

View reviewed changes

hgreebe reviewed Aug 4, 2025

View reviewed changes

hehe7318 referenced this pull request in hgreebe/aws-parallelcluster Aug 4, 2025

Change network interface setup logic to account for gb200

abf9c73

hehe7318 added 2 commits August 4, 2025 12:20

Address comments. describe_capacity_block_status now directly return …

5acfbff

…a list rather than a dict with key CapacityBlockStatuses.

gmarciani reviewed Aug 5, 2025

View reviewed changes

himani2411 reviewed Aug 6, 2025

View reviewed changes

Address comments: Improve error reporting in process_ultraserver_capa…

c58365d

…city_block_sizes. Add constant for code-style. Fix unit tests for new changes.

gmarciani approved these changes Aug 6, 2025

View reviewed changes

Merge branch 'develop' into wip/GB200-validator-and-LT

253780e

hehe7318 changed the title ~~[CLI][GB200] Add ultraserver instance(GB200) capacity block support~~ [CLI][GB200] Add ultraserver instance(p6e-gb200) capacity block support Aug 6, 2025

hehe7318 merged commit 8972fbf into aws:develop Aug 6, 2025
24 checks passed

hehe7318 mentioned this pull request Aug 6, 2025

Change network interface setup logic to account for gb200 #6930

Merged

himani2411 mentioned this pull request Aug 8, 2025

[SlurmTopo] Add support for slurm Block Topology aws/aws-parallelcluster-cookbook#3002

Merged

himani2411 mentioned this pull request Aug 15, 2025

[SlurmTopo] Remove queue update check as it doesnt support idempotency aws/aws-parallelcluster-cookbook#3010

Merged

himani2411 mentioned this pull request Sep 9, 2025

[Gb200] Update Boto3 version to add support for EC2 describe_capacity_block_status #6975

Merged

		return capacity_reservations_per_az


		class CapacityBlockHealthStatusValidator(Validator):

[CLI][GB200] Add ultraserver instance(p6e-gb200) capacity block support #6928

[CLI][GB200] Add ultraserver instance(p6e-gb200) capacity block support #6928

Uh oh!

Conversation

hehe7318 commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Tests

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmarciani Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

hehe7318 commented Aug 1, 2025 •

edited

Loading

gmarciani Aug 4, 2025 •

edited

Loading