diff --git a/README.md b/README.md index fe816e3..dc227e1 100644 --- a/README.md +++ b/README.md @@ -26,197 +26,237 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. --> +# PyTorch (LibTorch) Backend + [![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause) -# PyTorch (LibTorch) Backend +The Triton backend for +[PyTorch](https://github.com/pytorch/pytorch) +is designed to run +[TorchScript](https://pytorch.org/docs/stable/jit.html) +models using the PyTorch C++ API. +All models created in PyTorch using the python API must be traced/scripted to produce a TorchScript model. + +You can learn more about Triton backends in the +[Triton Backend](https://github.com/triton-inference-server/backend) +repository. + +Ask questions or report problems using +[Triton Server issues](https://github.com/triton-inference-server/server/issues). -The Triton backend for [PyTorch](https://github.com/pytorch/pytorch). -You can learn more about Triton backends in the [backend -repo](https://github.com/triton-inference-server/backend). Ask -questions or report problems on the [issues -page](https://github.com/triton-inference-server/server/issues). -This backend is designed to run [TorchScript](https://pytorch.org/docs/stable/jit.html) -models using the PyTorch C++ API. All models created in PyTorch -using the python API must be traced/scripted to produce a TorchScript -model. - -Where can I ask general questions about Triton and Triton backends? -Be sure to read all the information below as well as the [general -Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server) -available in the main [server](https://github.com/triton-inference-server/server) -repo. If you don't find your answer there you can ask questions on the -main Triton [issues page](https://github.com/triton-inference-server/server/issues). +Be sure to read all the information below as well as the +[general Triton documentation](https://github.com/triton-inference-server/server#triton-inference-server) +available in the [Triton Server](https://github.com/triton-inference-server/server) repository. ## Build the PyTorch Backend -Use a recent cmake to build. First install the required dependencies. +Use a recent cmake to build. +First install the required dependencies. -``` -$ apt-get install rapidjson-dev python3-dev python3-pip -$ pip3 install patchelf==0.17.2 +```bash +apt-get install rapidjson-dev python3-dev python3-pip +pip3 install patchelf==0.17.2 ``` -An appropriate PyTorch container from [NGC](https://ngc.nvidia.com) must be used. -For example, to build a backend that uses the 23.04 version of the PyTorch -container from NGC: +An appropriate PyTorch container from [NVIDIA NGC Catalog](https://ngc.nvidia.com) must be used. +For example, to build a backend that uses the 23.04 version of the PyTorch container from NGC: -``` -$ mkdir build -$ cd build -$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_PYTORCH_DOCKER_IMAGE="nvcr.io/nvidia/pytorch:23.04-py3" .. -$ make install +```bash +mkdir build +cd build +cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_PYTORCH_DOCKER_IMAGE="nvcr.io/nvidia/pytorch:23.04-py3" .. +make install ``` -The following required Triton repositories will be pulled and used in -the build. By default, the "main" branch/tag will be used for each repo -but the listed CMake argument can be used to override. +The following required Triton repositories will be pulled and used in the build. +By default, the `main` head will be used for each repository but the listed CMake argument can be used to override the value. -* triton-inference-server/backend: -DTRITON_BACKEND_REPO_TAG=[tag] -* triton-inference-server/core: -DTRITON_CORE_REPO_TAG=[tag] -* triton-inference-server/common: -DTRITON_COMMON_REPO_TAG=[tag] +* triton-inference-server/backend: `-DTRITON_BACKEND_REPO_TAG=[tag]` +* triton-inference-server/core: `-DTRITON_CORE_REPO_TAG=[tag]` +* triton-inference-server/common: `-DTRITON_COMMON_REPO_TAG=[tag]` ## Build the PyTorch Backend With Custom PyTorch -Currently, Triton requires that a specially patched version of -PyTorch be used with the PyTorch backend. The full source for -these PyTorch versions are available as Docker images from -[NGC](https://ngc.nvidia.com). For example, the PyTorch version -compatible with the 25.09 release of Triton is available as -nvcr.io/nvidia/pytorch:25.09-py3. +Currently, Triton requires that a specially patched version of PyTorch be used with the PyTorch backend. +The full source for these PyTorch versions are available as Docker images from +[NGC](https://ngc.nvidia.com). -Copy over the LibTorch and Torchvision headers and libraries from the +For example, the PyTorch version compatible with the 25.09 release of Triton is available as `nvcr.io/nvidia/pytorch:25.09-py3` which supports PyTorch version `2.9.0a0`. + +> [!NOTE] +> Additional details and version information can be found in the container's +> [release notes](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-25-09.html#rel-25-09). + +Copy over the LibTorch and TorchVision headers and libraries from the [PyTorch NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch) -into local directories. You can see which headers and libraries -are needed/copied from the docker. +into local directories. +You can see which headers and libraries are needed/copied from the docker. -``` -$ mkdir build -$ cd build -$ cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_PYTORCH_INCLUDE_PATHS="/torch;/torch/torch/csrc/api/include;/torchvision" -DTRITON_PYTORCH_LIB_PATHS="" .. -$ make install +```bash +mkdir build +cd build +cmake -DCMAKE_INSTALL_PREFIX:PATH=`pwd`/install -DTRITON_PYTORCH_INCLUDE_PATHS="/torch;/torch/torch/csrc/api/include;/torchvision" -DTRITON_PYTORCH_LIB_PATHS="" .. +make install ``` ## Using the PyTorch Backend -### Parameters +### PyTorch 2.0 Models -Triton exposes some flags to control the execution mode of the TorchScript models through -the Parameters section of the model's `config.pbtxt` file. +The model repository should look like: -* `DISABLE_OPTIMIZED_EXECUTION`: Boolean flag to disable the optimized execution -of TorchScript models. By default, the optimized execution is always enabled. +```bash +model_repository/ +`-- model_directory + |-- 1 + | |-- model.py + | `-- [model.pt] + `-- config.pbtxt +``` -The initial calls to a loaded TorchScript model take extremely long. Due to this longer -model warmup [issue](https://github.com/pytorch/pytorch/issues/57894), Triton also allows -execution of models without these optimizations. In some models, optimized execution -does not benefit performance as seen [here](https://github.com/pytorch/pytorch/issues/19978) -and in other cases impacts performance negatively, as seen [here](https://github.com/pytorch/pytorch/issues/53824). +The `model.py` contains the class definition of the PyTorch model. +The class should extend the +[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). +The `model.pt` may be optionally provided which contains the saved +[`state_dict`](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference) +of the model. -The section of model config file specifying this parameter will look like: +### TorchScript Models -``` -parameters: { -key: "DISABLE_OPTIMIZED_EXECUTION" - value: { - string_value: "true" - } -} +The model repository should look like: + +```bash +model_repository/ +`-- model_directory + |-- 1 + | `-- model.pt + `-- config.pbtxt ``` -* `INFERENCE_MODE`: Boolean flag to enable the Inference Mode execution -of TorchScript models. By default, the inference mode is enabled. +The `model.pt` is the TorchScript model file. -[InferenceMode](https://pytorch.org/cppdocs/notes/inference_mode.html) is a new -RAII guard analogous to NoGradMode to be used when you are certain your operations -will have no interactions with autograd. Compared to NoGradMode, code run under -this mode gets better performance by disabling autograd. +## Configuration -Please note that in some models, InferenceMode might not benefit performance -and in fewer cases might impact performance negatively. +Triton exposes some flags to control the execution mode of the TorchScript models through the `Parameters` section of the model's `config.pbtxt` file. -The section of model config file specifying this parameter will look like: +### Parameters -``` -parameters: { -key: "INFERENCE_MODE" - value: { - string_value: "true" - } -} -``` +* `DISABLE_OPTIMIZED_EXECUTION`: + Boolean flag to disable the optimized execution of TorchScript models. + By default, the optimized execution is always enabled. -* `DISABLE_CUDNN`: Boolean flag to disable the cuDNN library. By default, cuDNN is enabled. + The initial calls to a loaded TorchScript model take a significant amount of time. + Due to this longer model warmup + ([pytorch #57894](https://github.com/pytorch/pytorch/issues/57894)), + Triton also allows execution of models without these optimizations. + In some models, optimized execution does not benefit performance + ([pytorch #19978](https://github.com/pytorch/pytorch/issues/19978)) + and in other cases impacts performance negatively + ([pytorch #53824](https://github.com/pytorch/pytorch/issues/53824)). -[cuDNN](https://developer.nvidia.com/cudnn) is a GPU-accelerated library of primitives for -deep neural networks. cuDNN provides highly tuned implementations for standard routines. + The section of model config file specifying this parameter will look like: -Typically, models run with cuDNN enabled are faster. However there are some exceptions -where using cuDNN can be slower, cause higher memory usage or result in errors. + ```proto + parameters: { + key: "DISABLE_OPTIMIZED_EXECUTION" + value: { string_value: "true" } + } + ``` +* `INFERENCE_MODE`: -The section of model config file specifying this parameter will look like: + Boolean flag to enable the Inference Mode execution of TorchScript models. + By default, the inference mode is enabled. -``` -parameters: { -key: "DISABLE_CUDNN" - value: { - string_value: "true" - } -} -``` + [InferenceMode](https://pytorch.org/cppdocs/notes/inference_mode.html) is a new RAII guard analogous to `NoGradMode` to be used when you are certain your operations will have no interactions with autograd. + Compared to `NoGradMode`, code run under this mode gets better performance by disabling autograd. -* `ENABLE_WEIGHT_SHARING`: Boolean flag to enable model instances on the same device to -share weights. This optimization should not be used with stateful models. If not specified, -weight sharing is disabled. + Please note that in some models, InferenceMode might not benefit performance and in fewer cases might impact performance negatively. -The section of model config file specifying this parameter will look like: + To enable inference mode, use the configuration example below: -``` -parameters: { -key: "ENABLE_WEIGHT_SHARING" - value: { - string_value: "true" - } -} -``` + ```proto + parameters: { + key: "INFERENCE_MODE" + value: { string_value: "true" } + } + ``` -* `ENABLE_CACHE_CLEANING`: Boolean flag to enable CUDA cache cleaning after each model execution. -If not specified, cache cleaning is disabled. This flag has no effect if model is on CPU. -Setting this flag to true will negatively impact the performance due to additional CUDA cache -cleaning operation after each model execution. Therefore, you should only use this flag if you -serve multiple models with Triton and encounter CUDA out of memory issue during model executions. +* `DISABLE_CUDNN`: -The section of model config file specifying this parameter will look like: + Boolean flag to disable the cuDNN library. + By default, cuDNN is enabled. -``` -parameters: { -key: "ENABLE_CACHE_CLEANING" - value: { - string_value:"true" - } -} -``` + [cuDNN](https://developer.nvidia.com/cudnn) is a GPU-accelerated library of primitives for deep neural networks. + It provides highly tuned implementations for standard routines. + + Typically, models run with cuDNN enabled execute faster. + However there are some exceptions where using cuDNN can be slower, cause higher memory usage, or result in errors. + + To disable cuDNN, use the configuration example below: + + ```proto + parameters: { + key: "DISABLE_CUDNN" + value: { string_value: "true" } + } + ``` + +* `ENABLE_WEIGHT_SHARING`: + + Boolean flag to enable model instances on the same device to share weights. + This optimization should not be used with stateful models. + If not specified, weight sharing is disabled. + + To enable weight sharing, use the configuration example below: + + ```proto + parameters: { + key: "ENABLE_WEIGHT_SHARING" + value: { string_value: "true" } + } + ``` + +* `ENABLE_CACHE_CLEANING`: + + Boolean flag to enable CUDA cache cleaning after each model execution. + If not specified, cache cleaning is disabled. + This flag has no effect if model is on CPU. + + Setting this flag to true will likely negatively impact the performance due to additional CUDA cache cleaning operation after each model execution. + Therefore, you should only use this flag if you serve multiple models with Triton and encounter CUDA out-of-memory issues during model executions. + + To enable cleaning of the CUDA cache after every execution, use the configuration example below: + + ```proto + parameters: { + key: "ENABLE_CACHE_CLEANING" + value: { string_value: "true" } + } + ``` * `INTER_OP_THREAD_COUNT`: -PyTorch allows using multiple CPU threads during TorchScript model inference. -One or more inference threads execute a model's forward pass on the given -inputs. Each inference thread invokes a JIT interpreter that executes the ops -of a model inline, one by one. This parameter sets the size of this thread -pool. The default value of this setting is the number of cpu cores. Please refer -to [this](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html) -document on how to set this parameter properly. + PyTorch allows using multiple CPU threads during TorchScript model inference. + One or more inference threads execute a model’s forward pass on the given inputs. + Each inference thread invokes a JIT interpreter that executes the ops of a model inline, one by one. -The section of model config file specifying this parameter will look like: + This parameter sets the size of this thread pool. + The default value of this setting is the number of cpu cores. -``` -parameters: { -key: "INTER_OP_THREAD_COUNT" - value: { - string_value:"1" - } -} -``` + > [!TIP] + > Refer to + > [CPU Threading TorchScript](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html) + > on how to set this parameter properly. + + To set the inter-op thread count, use the configuration example below: + + ```proto + parameters: { + key: "INTER_OP_THREAD_COUNT" + value: { string_value: "1" } + } + ``` > [!NOTE] > This parameter is set globally for the PyTorch backend. @@ -225,70 +265,68 @@ key: "INTER_OP_THREAD_COUNT" * `INTRA_OP_THREAD_COUNT`: -In addition to the inter-op parallelism, PyTorch can also utilize multiple threads -within the ops (intra-op parallelism). This can be useful in many cases, including -element-wise ops on large tensors, convolutions, GEMMs, embedding lookups and -others. The default value for this setting is the number of CPU cores. Please refer -to [this](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html) -document on how to set this parameter properly. + In addition to the inter-op parallelism, PyTorch can also utilize multiple threads within the ops (intra-op parallelism). + This can be useful in many cases, including element-wise ops on large tensors, convolutions, GEMMs, embedding lookups and others. -The section of model config file specifying this parameter will look like: + The default value for this setting is the number of CPU cores. -``` -parameters: { -key: "INTRA_OP_THREAD_COUNT" - value: { - string_value:"1" - } -} -``` + > [!TIP] + > Refer to + > [CPU Threading TorchScript](https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html) + > on how to set this parameter properly. -> [!NOTE] -> This parameter is set globally for the PyTorch backend. -> The value from the first model config file that specifies this parameter will be used. -> Subsequent values from other model config files, if different, will be ignored. + To set the intra-op thread count, use the configuration example below: + + ```proto + parameters: { + key: "INTRA_OP_THREAD_COUNT" + value: { string_value: "1" } + } + ``` -* Additional Optimizations: Three additional boolean parameters are available to disable -certain Torch optimizations that can sometimes cause latency regressions in models with -complex execution modes and dynamic shapes. If not specified, all are enabled by default. +* **Additional Optimizations**: + + Three additional boolean parameters are available to disable certain Torch optimizations that can sometimes cause latency regressions in models with complex execution modes and dynamic shapes. + If not specified, all are enabled by default. `ENABLE_JIT_EXECUTOR` `ENABLE_JIT_PROFILING` -### PyTorch 2.0 Models +### Model Instance Group Kind -The model repository should look like: +The PyTorch backend supports the following kinds of +[Model Instance Groups](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups) +where the input tensors are placed as follows: -```bash -model_repository/ -`-- model_directory - |-- 1 - | |-- model.py - | `-- [model.pt] - `-- config.pbtxt -``` +* `KIND_GPU`: -The `model.py` contains the class definition of the PyTorch model. -The class should extend the -[`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). -The `model.pt` may be optionally provided which contains the saved -[`state_dict`](https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference) -of the model. + Inputs are prepared on the GPU device associated with the model instance. -### TorchScript Models +* `KIND_CPU`: -The model repository should look like: + Inputs are prepared on the CPU. -```bash -model_repository/ -`-- model_directory - |-- 1 - | `-- model.pt - `-- config.pbtxt -``` +* `KIND_MODEL`: -The `model.pt` is the TorchScript model file. + Inputs are prepared on the CPU. + When loading the model, the backend does not choose the GPU device for the model; + instead, it respects the device(s) specified in the model and uses them as they are during inference. + + This is useful when the model internally utilizes multiple GPUs, as demonstrated in + [this example model](https://github.com/triton-inference-server/server/blob/main/qa/L0_libtorch_instance_group_kind_model/gen_models.py). + + > [!IMPORTANT] + > If a device is not specified in the model, the backend uses the first available GPU device. + +To set the model instance group, use the configuration example below: + +```proto +instance_group { + count: 2 + kind: KIND_GPU +} +``` ### Customization @@ -329,69 +367,46 @@ parameters: { } ``` -### Support +## Important Notes -#### Model Instance Group Kind +* The execution of PyTorch model on GPU is asynchronous in nature. + See + [CUDA Asynchronous Execution](https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution) + for additional details. + Consequently, an error in PyTorch model execution may be raised during the next few inference requests to the server. + Setting environment variable `CUDA_LAUNCH_BLOCKING=1` when launching server will help in correctly debugging failing cases by forcing synchronous execution. -The PyTorch backend supports the following kinds of -[Model Instance Groups](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups) -where the input tensors are placed as follows: + * The PyTorch model in such cases may or may not recover from the failed state and a restart of the server may be required to continue serving successfully. -* `KIND_GPU`: Inputs are prepared on the GPU device associated with the model -instance. - -* `KIND_CPU`: Inputs are prepared on the CPU. - -* `KIND_MODEL`: Inputs are prepared on the CPU. When loading the model, the -backend does not choose the GPU device for the model; instead, it respects the -device(s) specified in the model and uses them as they are during inference. -This is useful when the model internally utilizes multiple GPUs, as demonstrated -in this -[example model](https://github.com/triton-inference-server/server/blob/main/qa/L0_libtorch_instance_group_kind_model/gen_models.py). -If no device is specified in the model, the backend uses the first available -GPU device. This feature is available starting in the 23.06 release. - -### Important Notes - -* The execution of PyTorch model on GPU is asynchronous in nature. See - [here](https://pytorch.org/docs/stable/notes/cuda.html#asynchronous-execution) - for more details. Consequently, an error in PyTorch model execution may - be raised during the next few inference requests to the server. Setting - environment variable `CUDA_LAUNCH_BLOCKING=1` when launching server will - help in correctly debugging failing cases by forcing synchronous execution. - * The PyTorch model in such cases may or may not recover from the failed - state and a restart of the server may be required to continue serving - successfully. - -* PyTorch does not support Tensor of Strings but it does support models that -accept a List of Strings as input(s) / produces a List of String as output(s). -For these models Triton allows users to pass String input(s)/receive String -output(s) using the String datatype. As a limitation of using List instead of -Tensor for String I/O, only for 1-dimensional input(s)/output(s) are supported -for I/O of String type. +* PyTorch does not support Tensor of Strings but it does support models that accept a List of Strings as input(s) / produces a List of String as output(s). + For these models Triton allows users to pass String input(s)/receive String output(s) using the String datatype. + As a limitation of using List instead of Tensor for String I/O, only for 1-dimensional input(s)/output(s) are supported for I/O of String type. * In a multi-GPU environment, a potential runtime issue can occur when using -[Tracing](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) -to generate a -[TorchScript](https://pytorch.org/docs/stable/jit.html) model. This issue -arises due to a device mismatch between the model instance and the tensor. By -default, Triton creates a single execution instance of the model for each -available GPU. The runtime error occurs when a request is sent to a model -instance with a different GPU device from the one used during the TorchScript -generation process. To address this problem, it is highly recommended to use -[Scripting](https://pytorch.org/docs/stable/generated/torch.jit.script.html#torch.jit.script) -instead of Tracing for model generation in a multi-GPU environment. Scripting -avoids the device mismatch issue and ensures compatibility with different GPUs -when used with Triton. However, if using Tracing is unavoidable, there is a -workaround available. You can explicitly specify the GPU device for the model -instance in the -[model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups) -to ensure that the model instance and the tensors used for inference are -assigned to the same GPU device as on which the model was traced. - -* Python functions optimizable by `torch.compile` may not be served directly in the `model.py` file, they need to be enclosed by a class extending the - [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). + [Tracing](https://pytorch.org/docs/stable/generated/torch.jit.trace.html) + to generate a + [TorchScript](https://pytorch.org/docs/stable/jit.html) + model. + This issue arises due to a device mismatch between the model instance and the tensor. -* Model weights cannot be shared across multiple instances on the same GPU device. + By default, Triton creates a single execution instance of the model for each available GPU. + The runtime error occurs when a request is sent to a model instance with a different GPU device from the one used during the TorchScript generation process. + + To address this problem, it is highly recommended to use + [Scripting](https://pytorch.org/docs/stable/generated/torch.jit.script.html#torch.jit.script) + instead of Tracing for model generation in a multi-GPU environment. + Scripting avoids the device mismatch issue and ensures compatibility with different GPUs when used with Triton. + + However, if using Tracing is unavoidable, there is a workaround available. + You can explicitly specify the GPU device for the model instance in the + [model configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#instance-groups) + to ensure that the model instance and the tensors used for inference are assigned to the same GPU device as on which the model was traced. * When using `KIND_MODEL` as model instance kind, the default device of the first parameter on the model is used. + +> [!WARNING] +> +> * Python functions optimizable by `torch.compile` may not be served directly in the `model.py` file, they need to be enclosed by a class extending the + [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). +> +> * Model weights cannot be shared across multiple instances on the same GPU device.