ai/ml resources to master state-of-the-art (SOTA) techniques from engineers and researchers đź§ đź’»
Contents:
- End to end free guides to follow
- Interesting papers you MUST read
- Main AI blogs to read regularly (continuous learning)
- Deep dive into all core AI concepts [Learn step-by-step]
- MAYBE guides you may go through
- Want to contribute in leading AI open-source projects?
MUST:
- CS229: 20 videos on ML basics by Andrew Ng, Stanford University (rating 10/10)
- AI Engineering handbook: Book by DeepSeek covering all major concepts in modern AI and AI engineering; Must for reference (rating 10/10)
- CSE223: ML Sys course by Prof Hao Zhang (rating 10/10) by UC San Diego (core engineering LLM serving concepts)
- The Ultra-Scale Playbook: by HuggingFace on Training LLMs on GPU Clusters (rating 8.5/10)
- AI and memory wall: How memory is the main bottleneck for LLM?
- Collective Communication for 100k+ GPUs by Meta
- The Landscape of GPU-Centric Communication
- Pre-training under infinite compute by Stanford University
- Give Me BF16 or Give Me Death by RedHat and Give Me FP32 or Give Me Death?
Others:
- LLMs don't just memorize, they build a geometric map that helps them reason by Google
- Self-Adapting Language Models by MIT
- NVIDIA Developer Blog: Deep dive into multiple AI topics.
- TensorRT LLM tech blogs: Deep dive into technical techniques/optimizations in one of the leading LLM inference library. (13 posts as of now)
- SGLang tech blog: SGLang is one of the leading LLM serving framework. Most blogs are around SGLang but is rich in technical information.
- AI System co-design at Meta
YouTube channels to follow regularly:
- vLLM office hours: Deep dive into various technical topics in vLLM
- GPU Mode: Deep dive into various LLM topics from guests from the AI community
- PyTorch channel: videos of various PyTorch events covering keynotes of technical topics like torch.compile.
Listed only high-quality resources. No need to read 100s of posts to get an idea. Just one post should be enough.
- GPU architecture
Current SOTA AI/LLM workloads are possible only because of GPUs. Understanding GPU architecture gives you an engineering edge.
- Understanding GPU architecture with MatMul, intuition about GPUs
- GPU Shared memory banks / microbenchmarks GPU programming concepts:
- CUDA programming model, GPU memory management: Mark Harris's GTC Talk on Coalesced Memory Access, Prefix Sum/ Scan in GPU
- Programming Massively Parallel Processors series on YT
- Performance
- Performance metrics by nccl tests, Profiling guide by Nsight, Understanding DL performance
- Transformer
- CME 295: Basics of Transformer and LLM course by Stanford University
- Transformer overall: Encoder-only and Decoder-only models
- BERT (insightful): BERT as text diffusion step
- Memory requirements for LLM. There are 4 parts: activation, parameter, gradient, optimizer states. How LLM handle memory?
- Attention
- Self-attention / Multi-head attention (MHA), Multi-Query attention (MQA), Group Query Attention (GQA), MLA (used in DeepSeek)
- FlashAttention (paper1, paper for v2, paper for v3, Online softmax, Implementation by Tri Dao
- Ring Attention (links to Context Parallelism CP): Handles large sequence length, Flex Attention by PyTorch
- KV cache, FP8 KV cache, Paged Attention
- Data Parallel (DP) Attention
Core operations
- GEMM / MatMul, API of GEMM, GEMM as core of AI, W4A8 GEMM Kernel
- MoE (Mixture of experts)
- Embedding (deepdive), RoPE (paper)
- Quantization
- Quantization basics
- Different data type stimulations
- INT8 quantization using QAT, LLM quantization with PTQ, FP8 datatype, AWQ
- Per-tensor and per-block scaling
- NVFP4 training, Optimizing FP4 Mixed-Precision Inference on AMD GPUs, Recent LLM quantization progress
- Quantization on CPU (GGUF, AWQ, GPTQ), GGUF quantization method, GPTQ: Post training quantization for LLM. OBQ: Post-Training Quantization and Pruning
- Mixed precision training
- NVFP4, GEMV kernel for NVFP4
- Details on FP8 training
- Post-training
- Post training concepts with SFT, RLHF, RLFR
- Smol Training Playbook: The Secrets to Building World-Class LLMs
- Optimizations
- LLM Inference optimizations; optimizations v2
- 5D parallelism PP, SP, DP, TP, CP, EP, parallelism concept for LLM scaling. Parallelism in PyTorch
- Chunk prefill - SARATHI paper, dynamic and continuous batching
- KV cache offloading, KVcache early reuse
- Speculative decoding, (basic introduction), Look-ahead reasoning, Paper from Google and DeepMind
- MoE using Wide Expert Parallelism EP
- Scheduling / Routing
- P/D disaggregation, DistServe P/D disaggregation paper
- KVCache-centric disaggregated architecture by MooncakeAI
- OverFill: Two-Stage Models for Efficient Language Model Decoding by Cornell University
- Software tools AI
- vLLM arch: architecture of the leading LLM serving engine.
Insights:
- MinMax M2 using Full Attention: why full attention is better than masked attention?
Practical:
- CUDA Compiler & PTX with example
- CUTLASS: template library to code in CUDA easily
- Matrix transpose using CUTLASS
- SGLang inference engine architecture
- FlexAttention using CuTE DSL
- MatMul using WGMMA, GEMM with pipelining in CUTLASS
- Scaling a model by Jax (Google) (rating 7/10)
- Smol training playbook by HuggingFace to train LLMs
- GPU Gems 3: if you want to dive deep into GPU programming
- (blog) OpenVINO optimizations and engineering by Intel
- (blog) Engineering posts by Colfax Research
- (blog) GPU MODE lecture notes
- (blog) Connectionism- Thinking Machine blog: AI startup. Founded by Mira Murati, former CTO at OpenAI. Solved nondeterminism problem in LLM.
- Llama visualization: step by step analyze each tensor as it is processed in Llama
Get started in these:
- SGLang: LLM serving engine originally from UC Berkeley.
- vLLM: LLM inference engine originally from UC Berkeley.
- PyTorch: Leading AI framework by Meta
- TensorFlow: AI framework by Google
- TensorRT: High performance inference library by NVIDIA
- TensorRT-LLM: LLM inference library by NVIDIA
- NCCL: High performance GPU communication library by NVIDIA
- See other NVIDIA libraries.





