Skip to content

Performance degradation on NCCL test with nccl-tests.Dockerfile based on NCCL 2.27.7 #892

@cyberchip-wang

Description

@cyberchip-wang

The latest version of the nccl-tests.Dockerfile based on NCCL 2.27.7 has a severe performance degradation compared with previous version based on NCCL 2.27.5. Please update to a newer version to avoid NCCL 2.27.7

The 10% performance degradation is observed while running on the same set of two P6-B200 nodes with AWS Batch/ECS using 16 GPUs. The tests are done with the images https://gallery.ecr.aws/hpc-cloud/nccl-tests with tags cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4 and cuda12.8.1-efa1.43.2-ofiv1.16.3-ncclv2.27.7-1-testsv2.16.9

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions