diff --git a/docs/source-fabric/advanced/compile.rst b/docs/source-fabric/advanced/compile.rst new file mode 100644 index 0000000000000..3e47991675b54 --- /dev/null +++ b/docs/source-fabric/advanced/compile.rst @@ -0,0 +1,299 @@ +################################# +Speed up models by compiling them +################################# + +Compiling your PyTorch model can result in significant speedups, especially on the latest generations of GPUs. +This guide shows you how to apply ``torch.compile`` correctly in your code. + +.. note:: + + This requires PyTorch >= 2.0. + + +---- + + +********************************* +Apply torch.compile to your model +********************************* + +Compiling a model in a script together with Fabric is as simple as adding one line of code, calling :func:`torch.compile`: + +.. code-block:: python + + import torch + import lightning as L + + # Set up Fabric + fabric = L.Fabric(devices=1) + + # Define the model + model = ... + + # Compile the model + model = torch.compile(model) + + # `fabric.setup()` should come after `torch.compile()` + model = fabric.setup(model) + + +.. important:: + + You should compile the model **before** calling ``fabric.setup()`` as shown above for an optimal integration with features in Fabric. + +The newly added call to ``torch.compile()`` by itself doesn't do much. It just wraps the model in a "compiled model". +The actual optimization will start when calling ``forward()`` on the model for the first time: + +.. code-block:: python + + # 1st execution compiles the model (slow) + output = model(input) + + # All future executions will be fast (for inputs of the same size) + output = model(input) + output = model(input) + ... + +This is important to know when you measure the speed of a compiled model and compare it to a regular model. +You should always *exclude* the first call to ``forward()`` from your measurements, since it includes the compilation time. + +.. collapse:: Full example with benchmark + + Below is an example that measures the speedup you get when compiling the InceptionV3 from TorchVision. + + .. code-block:: python + + import statistics + import torch + import torchvision.models as models + import lightning as L + + + @torch.no_grad() + def benchmark(model, input, num_iters=10): + """Runs the model on the input several times and returns the median execution time.""" + start = torch.cuda.Event(enable_timing=True) + end = torch.cuda.Event(enable_timing=True) + times = [] + for _ in range(num_iters): + start.record() + model(input) + end.record() + torch.cuda.synchronize() + times.append(start.elapsed_time(end) / 1000) + return statistics.median(times) + + + fabric = L.Fabric(accelerator="cuda", devices=1) + + model = models.inception_v3() + input = torch.randn(16, 3, 510, 512, device=fabric.device) + + # Compile! + compiled_model = torch.compile(model) + + # Set up the model with Fabric + model = fabric.setup(model) + compiled_model = fabric.setup(compiled_model) + + # warm up the compiled model before we benchmark + compiled_model(input) + + # Run multiple forward passes and time them + eager_time = benchmark(model, input) + compile_time = benchmark(compiled_model, input) + + # Compare the speedup for the compiled execution + speedup = eager_time / compile_time + print(f"Eager median time: {eager_time:.4f} seconds") + print(f"Compile median time: {compile_time:.4f} seconds") + print(f"Speedup: {speedup:.1f}x") + + On an NVIDIA A100 SXM4 40GB with PyTorch 2.2.0, CUDA 12.1, we get the following speedup: + + .. code-block:: text + + Eager median time: 0.0254 seconds + Compile median time: 0.0185 seconds + Speedup: 1.4x + + +---- + + +****************** +Avoid graph breaks +****************** + +When ``torch.compile`` looks at the code in your model's ``forward()`` method, it will try to compile as much of the code as possible. +If there are regions in the code that it doesn't understand, it will introduce a so-called "graph break" that essentially splits the code in optimized and unoptimized parts. +Graph breaks aren't a deal breaker, since the optimized parts should still run faster. +But if you want to get the most out of ``torch.compile``, you might want to invest rewriting the problematic section of the code that produce the breaks. + +You can check whether your model produces graph breaks by calling ``torch.compile`` with ``fullraph=True``: + +.. code-block:: python + + # Force an error if there is a graph break in the model + model = torch.compile(model, fullgraph=True) + +Be aware that the error messages produced here are often quite cryptic, so you will likely have to do some `troubleshooting `_ to fully optimize your model. + + +---- + + +******************* +Avoid recompilation +******************* + +As mentioned before, the compilation of the model happens the first time you call ``forward()``. +At this point, PyTorch will inspect the input tensor(s) and optimize the compiled code for the particular shape, data type and other properties the input has. +If the shape of the input remains the same across all calls to ``forward()``, PyTorch will reuse the compiled code it generated and you will get the best speedup. +However, if these properties change across subsequent calls to ``forward()``, PyTorch will be forced to recompile the model for the new shapes, and this will significantly slow down your training if it happens on every iteration. + +**When your training suddenly becomes slow, it's probably because PyTorch is recompiling the model!** +Here are some common scenarios when this can happen: + +- Your Trainer code switches from training to validation/testing and the input shape changes, triggering a recompilation. +- Your dataset size is not divisible by the batch size, and the dataloader has ``drop_last=False`` (the default). + The last batch in your training loop will be smaller and trigger a recompilation. + +Ideally, you should try to make the input shape(s) to ``forward()`` static. +However, when this is not possible, you can request PyTorch to compile the code by taking into account possible changes to the input shapes. + +.. code-block:: python + + # On PyTorch < 2.2 + model = torch.compile(model, dynamic=True) + +A model compiled with ``dynamic=True`` will typically be slower than a model compiled with static shapes, but it will avoid the extreme cost of recompilation every iteration. +On PyTorch 2.2 and later, ``torch.compile`` will detect dynamism automatically and you should no longer need to set this. + +.. collapse:: Example with dynamic shapes + + The code below shows an example where the model recompiles for several seconds because the input shape changed. + You can compare the timing results by toggling ``dynamic=True/False`` in the call to ``torch.compile``: + + .. code-block:: python + + import time + import torch + import torchvision.models as models + import lightning as L + + fabric = L.Fabric(accelerator="cuda", devices=1) + + model = models.inception_v3() + + # dynamic=False is the default + torch._dynamo.config.automatic_dynamic_shapes = False + + compiled_model = torch.compile(model) + compiled_model = fabric.setup(compiled_model) + + input = torch.randn(16, 3, 512, 512, device=fabric.device) + t0 = time.time() + compiled_model(input) + torch.cuda.synchronize() + print(f"1st forward: {time.time() - t0:.2f} seconds.") + + input = torch.randn(8, 3, 512, 512, device=fabric.device) # note the change in shape + t0 = time.time() + compiled_model(input) + torch.cuda.synchronize() + print(f"2nd forward: {time.time() - t0:.2f} seconds.") + + With ``automatic_dynamic_shapes=True``: + + .. code-block:: text + + 1st forward: 41.90 seconds. + 2nd forward: 89.27 seconds. + + With ``automatic_dynamic_shapes=False``: + + .. code-block:: text + + 1st forward: 42.12 seconds. + 2nd forward: 47.77 seconds. + + Numbers produced with NVIDIA A100 SXM4 40GB, PyTorch 2.2.0, CUDA 12.1. + +---- + + +*********************************** +Experiment with compilation options +*********************************** + +There are optional settings that, depending on your model, can give additional speedups. + +**CUDA Graphs:** By enabling CUDA Graphs, CUDA will record all computations in a graph and replay it every time forward and backward is called. +The requirement is that your model must be static, i.e., the input shape must not change and your model must execute the same operations every time. +Enabling CUDA Graphs often results in a significant speedup, but sometimes also increases the memory usage of your model. + +.. code-block:: python + + # Enable CUDA Graphs + compiled_model = torch.compile(model, mode="reduce-overhead") + + # This does the same + compiled_model = torch.compile(model, options={"triton.cudagraphs": True}) + +| + +**Shape padding:** The specific shape/size of the tensors involved in the computation of your model (input, activations, weights, gradients, etc.) can have an impact on the performance. +With shape padding enabled, ``torch.compile`` can extend the tensors by padding to a size that gives a better memory alignment. +Naturally, the tradoff here is that it will consume a bit more memory. + +.. code-block:: python + + # Default is False + compiled_model = torch.compile(model, options={"shape_padding": True}) + + +You can find a full list of compile options in the `PyTorch documentation `_. + +---- + + +******************************************************* +(Experimental) Apply torch.compile over FSDP, DDP, etc. +******************************************************* + +As stated earlier, we recommend that you compile the model before calling ``fabric.setup()``. +However, if you are using DDP or FSDP with Fabric, the compilation won't incorporate the distributed calls inside these wrappers by default. +In an experimental feature, you can let ``fabric.setup()`` reapply the ``torch.compile`` call after the model gets wrapped in DDP/FSDP internally. +In the future, this option will become the default. + +.. code-block:: python + + # Choose a distributed strategy like DDP or FSDP + fabric = L.Fabric(devices=2, strategy="ddp") + + # Compile the model + model = torch.compile(model) + + # Default: `fabric.setup()` will not reapply the compilation over DDP/FSDP + model = fabric.setup(model, _reapply_compile=False) + + # Recompile the model over DDP/FSDP (experimental) + model = fabric.setup(model, _reapply_compile=True) + + +---- + + +************************************** +A note about torch.compile in practice +************************************** + +In practice, you will find that ``torch.compile`` often doesn't work well and can even be counter-productive. +Compilation may fail with cryptic error messages that are impossible to debug without help from the PyTorch team. +It is also not uncommon that ``torch.compile`` will produce a significantly *slower* model or one with much higher memory usage. +On top of that, the compilation phase itself can be incredibly slow, taking several minutes to finish. +For these reasons, we recommend that you don't waste too much time trying to apply ``torch.compile`` during development, and rather evaluate its effectiveness toward the end when you are about to launch long-running, expensive experiments. +Always compare the speed and memory usage of the compiled model against the original model! + +| diff --git a/docs/source-fabric/glossary/index.rst b/docs/source-fabric/glossary/index.rst index 298c08f4e2da5..ebfa1b23a3bd3 100644 --- a/docs/source-fabric/glossary/index.rst +++ b/docs/source-fabric/glossary/index.rst @@ -69,6 +69,11 @@ Glossary :button_link: ../advanced/distributed_communication.html :col_css: col-md-4 +.. displayitem:: + :header: Compile + :button_link: ../advanced/compile.html + :col_css: col-md-4 + .. displayitem:: :header: CUDA :button_link: ../fundamentals/accelerators.html diff --git a/docs/source-fabric/guide/index.rst b/docs/source-fabric/guide/index.rst index 7b13e8eb4bbc7..795d756d33549 100644 --- a/docs/source-fabric/guide/index.rst +++ b/docs/source-fabric/guide/index.rst @@ -157,6 +157,14 @@ Advanced Topics :height: 160 :tag: advanced +.. displayitem:: + :header: Speed up models by compiling them + :description: Use torch.compile to speed up models on modern hardware + :button_link: ../advanced/compile.html + :col_css: col-md-4 + :height: 150 + :tag: advanced + .. displayitem:: :header: Train models with billions of parameters :description: Train the largest models with FSDP across multiple GPUs and machines diff --git a/docs/source-fabric/index.rst b/docs/source-fabric/index.rst index d736ee1bb9e54..5051d9c1c02a9 100644 --- a/docs/source-fabric/index.rst +++ b/docs/source-fabric/index.rst @@ -113,8 +113,6 @@ Get Started
-.. Add callout items below this line - .. displayitem:: :header: Convert to Fabric in 5 minutes :description: Learn how to add Fabric to your PyTorch code @@ -168,8 +166,6 @@ Get Started
-.. End of callout item section - | | diff --git a/docs/source-fabric/levels/advanced.rst b/docs/source-fabric/levels/advanced.rst index 3760acab2e6da..965e848c7c993 100644 --- a/docs/source-fabric/levels/advanced.rst +++ b/docs/source-fabric/levels/advanced.rst @@ -5,6 +5,7 @@ <../advanced/gradient_accumulation> <../advanced/distributed_communication> <../advanced/multiple_setup> + <../advanced/compile> <../advanced/model_parallel/fsdp> <../guide/checkpoint/distributed_checkpoint> @@ -42,6 +43,14 @@ Advanced skills :height: 170 :tag: advanced +.. displayitem:: + :header: Speed up models by compiling them + :description: Use torch.compile to speed up models on modern hardware + :button_link: ../advanced/compile.html + :col_css: col-md-4 + :height: 170 + :tag: advanced + .. displayitem:: :header: Train models with billions of parameters :description: Train the largest models with FSDP across multiple GPUs and machines