NVIDIA Resiliency Extension

The NVIDIA Resiliency Extension (NVRx) integrates multiple resiliency-focused solutions for PyTorch-based workloads. Users can modularly integrate NVRx capabilities into their own infrastructure to maximize AI training productivity at scale. NVRx maximizes goodput by enabling system-wide health checks, quickly detecting faults at runtime and resuming training automatically. NVRx minimizes loss of work by enabling fast and frequent checkpointing.

For detailed documentation and usage information about each component, please refer to https://nvidia.github.io/nvidia-resiliency-ext/.

⚠️ NOTE: This project is still experimental and under active development. The code, features, and documentation are evolving rapidly. Please expect frequent updates and breaking changes. Contributions are welcome and we encourage you to watch for updates.

Core Components and Capabilities

Fault Tolerance
- Detection of hung ranks.
- Restarting training in-job, without the need to reallocate SLURM nodes.
In-Process Restarting
- Detecting failures and enabling quick recovery.
Async Checkpointing
- Providing an efficient framework for asynchronous checkpointing.
Local Checkpointing
- Providing an efficient framework for local checkpointing.
Straggler Detection
- Monitoring GPU and CPU performance of ranks.
- Identifying slower ranks that may impede overall training efficiency.
Framework Integration
- Facilitating seamless fault tolerance and straggler detection integration with PyTorch Lightning based workloads.
- Providing integration with NVIDIA NeMo framework, a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech).

Installation

From sources

git clone https://github.com/NVIDIA/nvidia-resiliency-ext
cd nvidia-resiliency-ext
pip install .

From PyPI wheel

pip install nvidia-resiliency-ext

Platform Support

Category	Supported Versions / Requirements
Architecture	x86_64, arm64
Operating System	Ubuntu 22.04, 24.04
Python Version	>= 3.10, < 3.13
PyTorch Version	>= 2.3.1 (injob & chkpt), >= 2.5.1 (inprocess)
CUDA & CUDA Toolkit	>= 12.5 (12.8 required for GPU health check)
NVML Driver	>= 535 (570 required for GPU health check)
NCCL Version	>= 2.21.5 (injob & chkpt), >= 2.26.2 (inprocess)

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/nvidia_resiliency_ext		src/nvidia_resiliency_ext
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
cupti_build.py		cupti_build.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NVIDIA Resiliency Extension

Core Components and Capabilities

Installation

From sources

From PyPI wheel

Platform Support

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 13

Languages

License

NVIDIA/nvidia-resiliency-ext

Folders and files

Latest commit

History

Repository files navigation

NVIDIA Resiliency Extension

Core Components and Capabilities

Installation

From sources

From PyPI wheel

Platform Support

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 13

Languages

Packages