Skip to content

Conversation

hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Aug 1, 2025

Description of changes

  • Add ultraserver instance support with capacity block validation

    • Add CapacityBlockHealthStatusValidator for UltraServer instance check.
      • The validator also collect the capacity blocks that are in valid CAPACITY_BLOCK_INACTIVE_STATES. (for example: state scheduled). In this scenario we throw a warning and report the ids and the states.
    • Add describe_capacity_block_status api in ec2.py
    • Add logic to collect all ultraserver instance capacity block ids in a dict, key -> ULTRASERVER_INSTANCE_PREFIX, value -> list of capacity block ids
    • Add ULTRASERVER_INSTANCE_PREFIX_LIST and ULTRASERVER_CAPACITY_BLOCK_ALLOWED_SIZE_DICT constants
    • Implement capacity block size validation for p6e-gb200 instances (9, 18 nodes)
    • Add ultraserver-specific template generation and validation logic
    • Move get_instance_type_and_reservation_type_from_capacity_reservation to ec2.py
  • Enhance template and configuration handling

    • Add capacity block support in cluster and queue stack templates
      • Pass ultrasever capacity block size to HeadNode dna.json. For p6e-gb200, "9", "18" or "9, 18".
      • Add has_ultraserver_instance to detect if the queue stack compute resources have ultraserver instance

Tests

  • Add ultraserver capacity block test suites
  • Add ultraserver cluster and queue stack tests
  • Fix test mocks to return proper CapacityReservationInfo objects
  • Add describe_capacity_reservations method to dummy EC2 client
  • Update existing tests for refactored capacity reservation function

hehe7318 added 2 commits July 29, 2025 17:24
…servation handling

- Add ultraserver instance support with capacity block validation
  - Add ULTRASERVER_INSTANCE_PREFIX_LIST and ULTRASERVER_CAPACITY_BLOCK_ALLOWED_SIZE_DICT constants
  - Implement capacity block size validation for p6e-gb200 instances (9, 18 nodes)
  - Add ultraserver-specific template generation and validation logic

- Refactor capacity reservation function organization
  - Move get_instance_type_and_reservation_type_from_capacity_reservation from utils.py to ec2.py
  - Convert to Ec2Client method to avoid circular imports and improve code organization
  - Update all imports and function calls across codebase to use new location
  - Leverage existing capacity_reservations_cache for better performance

- Enhance template and configuration handling
  - Add capacity block support in cluster and queue stack templates
  - Update CDK builder utils with ultraserver-specific resource naming
  - Extend cluster config validation for ultraserver instances

- tests
  - Add ultraserver capacity block test suites
  - Add ultraserver cluster and queue stack tests
  - Fix test mocks to return proper CapacityReservationInfo objects
  - Add describe_capacity_reservations method to dummy EC2 client
  - Update existing tests for refactored capacity reservation function
@hehe7318 hehe7318 requested a review from a team as a code owner August 1, 2025 19:05
@hehe7318 hehe7318 added the 3.x label Aug 1, 2025
@hehe7318 hehe7318 requested a review from a team as a code owner August 1, 2025 19:05
for page in page_iterator:
statuses.extend(page.get("CapacityBlockStatuses", []))

return {"CapacityBlockStatuses": statuses}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] The function accepts multiple capacity blocks ids (great!), but it returns a dictionary that does not allow the caller to understand which status map to which reservation.
The caller expects the function to return something about statuses. It is redundant to return a dictionary where the results are mapped to a key CapacityBlockStatuses. It's like having a function sum(a,b) that instead of returning a+b returns {"sumResult": a+b}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boto3 describe_capacity_block_status api call returns a dict: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2/client/describe_capacity_block_status.html#

Do we want to change the default behavior of this api call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got your point. The default behavior returns dict because it has key NextToken. But We have already handled it by paginator. Okay, I agree, I will let it directly return a list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for queue in self.scheduling.queues:
for compute_resource in queue.compute_resources:
cr_target = compute_resource.capacity_reservation_target or queue.capacity_reservation_target
if cr_target and cr_target.capacity_reservation_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the config specifies a resource group ARN rather than a reservation id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only accept capacity block for ultraserver instance. ResourceGroupArn is not considered here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a limitation captured in our code, i.e. we fail if we specify a capacity block and CapacityReservationResourceGroupArn does the validation fail? At least in documentation it does not seems so: https://docs.aws.amazon.com/parallelcluster/latest/ug/Scheduling-v3.html#yaml-Scheduling-SlurmQueues-CapacityReservationTarget

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean specify CapacityBlock and and CapacityReservationResourceGroupArn at the same time?
I think they can not be specified at the same time because CapacityReservationResourceGroupArn needs Instances section. https://docs.aws.amazon.com/parallelcluster/latest/ug/launch-instances-odcr-v3.html

      ComputeResources:
        - ...
          Instances:
            - InstanceType: instance

But if it's CapacityBlock, you have to use InstanceType section or leave it empty(optional). https://docs.aws.amazon.com/parallelcluster/latest/ug/launch-instances-capacity-blocks.html

   ComputeResources:
   - ...
     InstanceType: String (EC2 Instance type of the CB)

But even if it doesn't fail, why it matters in our case? This function is to collect ultraserver_capacity_block_dict, ultraserver instance can not in a CapacityReservationResourceGroupArn, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case we are talking about is: a compute resource with CAPACITY BLOCK, instance type specified and a resourcegroupArn rather than a reservationId.

I think we should support this scenario. If we do not support it, let's verify that that there exist a validator to prevent it. If it does not exist, the next steps are:
1.If the team agrees on supporting the use case, let's address it in a follow up PR
2. if the team agrees on not supporting it, let;s add a validator to prevent it in a follow up PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a sync with Giacomo. We are not going to support it because: >Capacity Reservation Groups are designed to work with standard Capacity Reservations, not with Capacity Blocks. They allow you to group and manage multiple Capacity Reservations together, but this functionality does not extend to Capacity Blocks.

),
"launch_template_id": launch_template_id,
**(
{"p6e_gb200_capacity_block_sizes": cluster_ultraserver_capacity_block_sizes_dict["p6e-gb200"]}
Copy link
Contributor

@gmarciani gmarciani Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are injecting into dna.json a mapping between instance type prefix and block size? Shouldn't the cookbook know exactly what block size a specific reservation has?
Also, given that we are getting closer to the template size limit, what about making the key shorter: Eg: p6egb200_block_sizes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are injecting into dna.json a mapping between instance type prefix and block size? Shouldn't the cookbook know exactly what block size a specific reservation has?

This is a valid point. Yes, we are injecting.
Regarding g6e-gb200 instances specifically, the block size values are constrained to [9, 18]. According to Himani, these values can be combined. So, when performing the injection, we have three possible approaches:

  • Single value "9"
  • Single value "18"
  • Combined value "9, 18"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about making the key shorter: Eg: p6egb200_block_sizes

Okay, done.

return capacity_reservations_per_az


class CapacityBlockHealthStatusValidator(Validator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] When the cluster is created, the capacity block may be inactive and the status information may be unavailable. In this case, the validator should not fail because it is legit for the user to create the cluster when the capacity block is not active yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a talk with Himani and we reached an agreement on this. According to the design doc, this validator is needed. I think for ultraserver instance, it's not like general instance types. Can we keep it for now and have a talk with Himani to reach a final decision?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with returning a warning because it does not block the creation. However: we should return the warning the message should be more user friendly. Instead of DescribeCapacityBlockStatus returned no entries for the provided Capacity Blocks; unable to verify health. I'd say Could not verify health status for capacity block {reservation_id}. Please check the capacity reservation..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

  • Update CapacityBlockHealthStatusValidator to support inactive capacity block states
    • Add CAPACITY_BLOCK_INACTIVE_STATES constant for scheduled, payment-pending, assessing, delayed states
    • Change validation logic to check for good interconnect status ('ok') instead of bad states
    • Replace TotalUnavailableCapacity check with TotalInstanceCount vs AvailableInstanceCount comparison
    • Add fallback to describe_capacity_reservations for inactive capacity blocks
    • Issue warnings for inactive capacity blocks instead of errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this does not impact the update path, I'm ok with the error.

for size in unique_sizes:
if size not in allowed_sizes_list:
raise BadRequestException(
f"The capacity block sizes for ultraserver instance {ultraserver_instance_prefix} are "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] What is the user experience in this case? Will the error surfaced to the users allow them to immediately know what capacity reservation has the unsupported block size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, to clarify, when making capacity block reservation for ultraserver instance, you can only choose the block size from a fixed block size values list, for example, p6e-gb200 allowed block sizes are 9 and 18.

So usually this exception will not be raised. I added it just in case in the future something changes.

But if it's raised, it will tell customer your current capacity block sizes for {ultraserver_instance_prefix}(for example p6e-gb200) are like [9, 15, 18]. The allowed list is [9, 18]. So they can check their capacity block and find why there's a 15.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the user has multiple CB for the same instance type family and only one of those is misconfigured? In that case the user must go through every reservation to spot the problem. The best user experience would be to return an error message like The capacity block {reservation_id} has invalid block size {detected_size}. Allowed values are {allowed values}.

Is there any blocker to implement this in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, got it, implementing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

compute_lt_nw_interfaces.append(
ec2.CfnLaunchTemplate.NetworkInterfaceProperty(
device_index=0 if network_card.maximum_network_interfaces() == 1 else 1,
# Set device_index=0 for ultraserver instances or single-NIC network cards
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_index should be 0 for only Odd numbered device indexes. This is already covered in the existing logic because the odd device indexes have maximum_network_interfaces() = 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Got it, it's a bit different from Himani's design, did you two reach an agreement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_index should be 0 for only Odd numbered device indexes.

This is my understanding as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

"""
Check if the compute resource uses ultraserver instances with capacity blocks.
Ultraserver instances (e.g., p6e-gb200) require special network interface configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite right. Only the odd indexed cards need the device_index = 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

(
instance_type_from_capacity_reservation,
_,
) = AWSApi.instance().ec2.get_instance_type_and_reservation_type_from_capacity_reservation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the benefits of creating an entirely new function rather than just using/building upon the existing _instance_type_from_capacity_reservation() function. It seems like we are repeating pre-existing logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. To support ultraserver instance I need to also get the reservation_type to check if it's "capacity-block".
  2. I need this function in other places to handle the truth that the capacity_reservation_target section can under both queue level and compute resource level.
  3. Also I need this function when building stack, if keep it here, there's a circular dependencies issue.

hehe7318 referenced this pull request in hgreebe/aws-parallelcluster Aug 4, 2025
…a list rather than a dict with key CapacityBlockStatuses.
…ve device index validation

- Update CapacityBlockHealthStatusValidator to support inactive capacity block states
  - Add CAPACITY_BLOCK_INACTIVE_STATES constant for scheduled, payment-pending, assessing, delayed states
  - Change validation logic to check for good interconnect status ('ok') instead of bad states
  - Replace TotalUnavailableCapacity check with TotalInstanceCount vs AvailableInstanceCount comparison
  - Add fallback to describe_capacity_reservations for inactive capacity blocks
  - Issue warnings for inactive capacity blocks instead of errors
- Remove device index validation from queue stack processing
- Update related unit tests to reflect new validation behavior
size = status.get("TotalCapacity")
if size is not None:
if size not in allowed_sizes_list:
raise BadRequestException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] if you raise the exception here and you have multiple invalid statuses, you will only surface the first one to the user, whereas it could be useful to provide the full list of reservations with invalid status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, done.

# This section collects capacity block sizes for ultraserver instances (e.g., p6e-gb200)
# and validates that they conform to allowed size configurations for Slurm topology
cluster_ultraserver_capacity_block_sizes_dict = {}
if self.config.scheduling.scheduler == "slurm":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[minor] Use the constants whenever it is possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_REVISION = 1
PCLUSTER_BUILD_IMAGE_CLEANUP_ROLE_BOOTSTRAP_TAG_KEY = "parallelcluster:build-image-cleanup-role-bootstrapped"

ULTRASERVER_INSTANCE_PREFIX_LIST = ["p6e-gb200"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[codestyle] This instance type prefix is reused many times in this PR.
It may make sense to have a constant P6E_GB200=p6e-gb200 defined here and reused everywhere in the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# Not in inactive state, so it's a real error
self._add_failure(
f"Unable to retrieve status for Capacity Block {cb_id}. "
f"Please verify the Capacity Block ID is valid and accessible.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[major] This case may occur when the user specifies a non existing capacity block. I think we should surface a warning rather than an error because this case occurs if the user specifies a non existing reservation and:

  1. in this case we already have a validator checking the existance of the reservation, so no reason to have two failures
  2. this failure can block the cluster update. Example: the user tries to remove the reservation from the config via clusert update because the reservation does not exist anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't convince me:

in this case we already have a validator checking the existance of the reservation, so no reason to have two failures

That's not true. We need it. See link under State -> (string). Think about when customer try to create a cluster with a CB that is in unsupported state. describe_capacity_block_status failed so we enter this except block. describe_capacity_reservations succeeds, and output is State -> unsupported. In this case we want to prevent customer from create the cluster because: The Capacity Reservation will not be delivered.

this failure can block the cluster update. Example: the user tries to remove the reservation from the config via clusert update because the reservation does not exist anymore.

Correct me if I am wrong. It can not, when update, the cli checks the updated config, which means when customer remove the CB from the config, the update will not check the previous one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. If it does not impact the update, I'm ok with the error

# Report inactive capacity blocks as warnings
if inactive_blocks:
self._add_failure(
f"Cannot verify health status for inactive Capacity Blocks: {', '.join(inactive_blocks)}. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] Capacity Block -> capacity block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if inactive_blocks:
self._add_failure(
f"Cannot verify health status for inactive Capacity Blocks: {', '.join(inactive_blocks)}. "
f"Health status validation will be performed when the capacity blocks become active.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary to specify that health status would be checked when the reservation is active.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Weird ^_^. I remembered I removed it.

(
"One or more Capacity Blocks are not healthy or have insufficient capacity: "
+ "; ".join(unhealthy_details)
+ ". Please ensure each Capacity Block reports InterconnectStatus='ok' and "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rephrase to capacity block is healthy and all reserved capacity is available. The fact that we use a specific field to determine the healthiness is an internal detail. If you disagree and want to keep it, let's inject the _GOOD_INTERCONNECT in the message at least

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, you convinced me. Done!

**(
{"p6egb200_block_sizes": cluster_ultraserver_capacity_block_sizes_dict["p6e-gb200"]}
if "p6e-gb200" in cluster_ultraserver_capacity_block_sizes_dict
and cluster_ultraserver_capacity_block_sizes_dict["p6e-gb200"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add unit tests for checking this is generated as expected?
It would also be easier to understand the structure using unit tests

Copy link
Contributor

@himani2411 himani2411 Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS there a reason why the name is p6egb200_block_sizes? why not a generaic name topology_block_size or just block_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add unit tests for checking this is generated as expected?
It would also be easier to understand the structure using unit tests

I believe I already added it in test_ultraserver_cluster_stack.py.

IS there a reason why the name is p6egb200_block_sizes? why not a generaic name topology_block_size or just block_size?

Because in the future if other ultraserver instance types are introduced, don't we want another section in dna.json?

…city_block_sizes. Add constant for code-style. Fix unit tests for new changes.
return capacity_reservations_per_az


class CapacityBlockHealthStatusValidator(Validator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this does not impact the update path, I'm ok with the error.

# Not in inactive state, so it's a real error
self._add_failure(
f"Unable to retrieve status for Capacity Block {cb_id}. "
f"Please verify the Capacity Block ID is valid and accessible.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. If it does not impact the update, I'm ok with the error

@hehe7318 hehe7318 changed the title [CLI][GB200] Add ultraserver instance(GB200) capacity block support [CLI][GB200] Add ultraserver instance(p6e-gb200) capacity block support Aug 6, 2025
@hehe7318 hehe7318 merged commit 8972fbf into aws:develop Aug 6, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants