Skip to content

Conversation

tengomucho
Copy link

@tengomucho tengomucho commented Sep 23, 2025

Issue #5273

  • transformers: 4.51.3
  • torch: 2.7.1
  • diffusers: 0.35.1
  • peft: 0.17.0

Note:

  • If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the right.

  • All PR's are checked weekly for staleness. This PR will be closed if not updated in 30 days.

Description

Tests Run

By default, docker image builds and tests are disabled. Two ways to run builds and tests:

  1. Using dlc_developer_config.toml
  2. Using this PR description (currently only supported for PyTorch, TensorFlow, vllm, and base images)
How to use the helper utility for updating dlc_developer_config.toml

Assuming your remote is called origin (you can find out more with git remote -v)...

  • Run default builds and tests for a particular buildspec - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -cp origin

  • Enable specific tests for a buildspec or set of buildspecs - also commits and pushes changes to remote; Example:

python src/prepare_dlc_dev_environment.py -b </path/to/buildspec.yml> -t sanity_tests -cp origin

  • Restore TOML file when ready to merge

python src/prepare_dlc_dev_environment.py -rcp origin

NOTE: If you are creating a PR for a new framework version, please ensure success of the local, standard, rc, and efa sagemaker tests by updating the dlc_developer_config.toml file:

  • sagemaker_remote_tests = true
  • sagemaker_efa_tests = true
  • sagemaker_rc_tests = true
  • sagemaker_local_tests = true
How to use PR description Use the code block below to uncomment commands and run the PR CodeBuild jobs. There are two commands available:
  • # /buildspec <buildspec_path>
    • e.g.: # /buildspec pytorch/training/buildspec.yml
    • If this line is commented out, dlc_developer_config.toml will be used.
  • # /tests <test_list>
    • e.g.: # /tests sanity security ec2
    • If this line is commented out, it will run the default set of tests (same as the defaults in dlc_developer_config.toml): sanity, security, ec2, ecs, eks, sagemaker, sagemaker-local.
# /buildspec <buildspec_path>
# /tests <test_list>

Formatting

PR Checklist

Expand
  • I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
  • If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
  • If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
  • (If applicable) I've documented below the DLC image/dockerfile this relates to
  • (If applicable) I've documented below the tests I've run on the DLC image
  • (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

Expand
  • (If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
  • (If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
  • (If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
  • (If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@ahsan-z-khan ahsan-z-khan requested a review from a team as a code owner September 29, 2025 19:21
"tensorboard>=2.11.0" \
"numpy>=1.24.3,<=1.25.2" \
"numba==0.58.1" \
"Pillow==10.3.0" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pillow is getting installed before too




# ------------------------------------------------------------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to separate this? We can move all ARG to above, and combine the PIP install commands.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that the image for SDK was not published at the time I submitted the PR, see the comments here https://github.com/tengomucho/deep-learning-containers/blob/f25c8ecb5ade64b78985628e262b1d012be6283d/huggingface/pytorch/training/docker/2.7/py3/sdk2.24.1/Dockerfile.neuronx#L1.
Let me use that image now that is available, it will make the dockerfile leaner.

# Pin numpy to version required by neuronx-cc
# Update Pillow, urllib, wandb versions to fix high and critical vulnerabilities
RUN pip install -U \
"tensorboard>=2.11.0" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensorboard is also duplicate. please combine all of them

# To fix that, we are downgrading networkx to 2.6.3
RUN pip install -U "networkx==2.6.3"

RUN apt-get update \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all APT installs in the beginning

@ahsan-z-khan ahsan-z-khan changed the title [HuggingFace][Neuronx] Training - DLC for Optimum-neuron 0.3.0 - Neuron SDK 2.24.0 PyTorch 2.7.1 - Transformers 4.51.3 [HuggingFace][Neuronx] Training - DLC for Optimum-neuron 0.3.0 - Neuron SDK 2.24.1 PyTorch 2.7.1 - Transformers 4.51.3 Oct 2, 2025
@tengomucho tengomucho force-pushed the update-hf-pt2.7-sdk2.24-trn branch from 038d1d1 to 64854ad Compare October 3, 2025 14:47
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain what is this doing?

@aws-deep-learning-containers-ci aws-deep-learning-containers-ci bot added authorized build Reflects file change in build folder huggingface Reflects file change in huggingface folder sagemaker_tests Size:L test Reflects file change in test folder labels Oct 8, 2025
Remove previous change
fix test_stray_files failure and install pytorch-lightning
Change numpy and torchvision versions
@arjraman arjraman changed the title [HuggingFace][Neuronx] Training - DLC for Optimum-neuron 0.3.0 - Neuron SDK 2.24.1 PyTorch 2.7.1 - Transformers 4.51.3 [HuggingFace][Neuronx] Training - DLC for Optimum-neuron 0.3.0 - Neuron SDK 2.24.1 PyTorch 2.7.0 - Transformers 4.51.3 Oct 9, 2025
@arjraman arjraman changed the title [HuggingFace][Neuronx] Training - DLC for Optimum-neuron 0.3.0 - Neuron SDK 2.24.1 PyTorch 2.7.0 - Transformers 4.51.3 [HuggingFace][Neuronx] Training - DLC for Optimum-neuron 0.3.0 - Neuron SDK 2.24.1 PyTorch 2.7.1 - Transformers 4.51.3 Oct 9, 2025
@aws-deep-learning-containers-ci aws-deep-learning-containers-ci bot added the Size:XL Determines the size of the PR label Oct 9, 2025
@junpuf junpuf removed sanity labels Oct 9, 2025
Make test use ml.trn1.32xlarge instance type
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
authorized build Reflects file change in build folder huggingface Reflects file change in huggingface folder sagemaker_tests Size:XL Determines the size of the PR test Reflects file change in test folder unauthorized
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants