- Lecture: slides, video
- Seminar: folder, video
- Homework: see homework/README.md
- Blog post about reduced precision FP formats
- NVIDIA blog posts about mixed precision training with Tensor Cores, Tensor Core performance tips, TF32 Tensor Cores
- Presentations about Tensor Cores: one, two, three
- Tensor Core Requirements and Mixed Precision Training sections of the NVIDIA DL performance guide
- Automatic Mixed Precision in PyTorch
- TF32 section of PyTorch CUDA docs
- AMP, FP16 and BF16 in DeepSpeed
- PyTorch Performance Tuning Guide
- Latency Numbers Every Programmer Should Know
- Pillow Performance benchmarks
- Faster Image Processing tips from fastai docs
- Rapid Data Pre-Processing with NVIDIA DALI
- General-purpose Python profilers: builtins (cProfile and profile), pyinstrument, memory_profiler, py-spy, Scalene
- DLProf user guide
- How to profile with DLProf
- Profiling and Optimizing Deep Neural Networks with DLProf and PyProf
- NVIDIA presentations on profiling DL networks, profiling for DL and mixed precision
- Profiling Deep Learning Workloads
- PyTorch Profiler and PyTorch Profiler with TensorBoard tutorial
- torch.utils.bottleneck quick guide
- PyTorch Autograd profiler tutorial
- Nsight Systems and Nsight Compute user guides
- Video tutorial about speeding up and profiling neural networks