AI-Engineering-Awesome

ai/ml resources to master state-of-the-art (SOTA) techniques from engineers and researchers 🧠💻

Contents:

End to end free guides to follow
Interesting papers you MUST read
Main AI blogs to read regularly (continuous learning)
Deep dive into all core AI concepts [Learn step-by-step]
MAYBE guides you may go through
Want to contribute in leading AI open-source projects?

End to end free guides to follow

MUST:

CS229: 20 videos on ML basics by Andrew Ng, Stanford University (rating 10/10)
AI Engineering handbook: Book by DeepSeek covering all major concepts in modern AI and AI engineering; Must for reference (rating 10/10)
CSE223: ML Sys course by Prof Hao Zhang (rating 10/10) by UC San Diego (core engineering LLM serving concepts)
The Ultra-Scale Playbook: by HuggingFace on Training LLMs on GPU Clusters (rating 8.5/10)

Core AI engineering papers you MUST read

AI and memory wall: How memory is the main bottleneck for LLM?
Collective Communication for 100k+ GPUs by Meta
The Landscape of GPU-Centric Communication
Pre-training under infinite compute by Stanford University
Give Me BF16 or Give Me Death by RedHat and Give Me FP32 or Give Me Death?

Others:

LLMs don't just memorize, they build a geometric map that helps them reason by Google
Self-Adapting Language Models by MIT

Main AI blogs to read regularly (continuous learning)

NVIDIA Developer Blog: Deep dive into multiple AI topics.
TensorRT LLM tech blogs: Deep dive into technical techniques/optimizations in one of the leading LLM inference library. (13 posts as of now)
SGLang tech blog: SGLang is one of the leading LLM serving framework. Most blogs are around SGLang but is rich in technical information.
AI System co-design at Meta

YouTube channels to follow regularly:

vLLM office hours: Deep dive into various technical topics in vLLM
GPU Mode: Deep dive into various LLM topics from guests from the AI community
PyTorch channel: videos of various PyTorch events covering keynotes of technical topics like torch.compile.

Deep dive into AI concepts [Learn step-by-step]

Listed only high-quality resources. No need to read 100s of posts to get an idea. Just one post should be enough.

GPU architecture
Current SOTA AI/LLM workloads are possible only because of GPUs. Understanding GPU architecture gives you an engineering edge.

Understanding GPU architecture with MatMul, intuition about GPUs
GPU Shared memory banks / microbenchmarks GPU programming concepts:
CUDA programming model, GPU memory management: Mark Harris's GTC Talk on Coalesced Memory Access, Prefix Sum/ Scan in GPU
Programming Massively Parallel Processors series on YT

Performance

Performance metrics by nccl tests, Profiling guide by Nsight, Understanding DL performance

Transformer

CME 295: Basics of Transformer and LLM course by Stanford University
Transformer overall: Encoder-only and Decoder-only models
BERT (insightful): BERT as text diffusion step
Memory requirements for LLM. There are 4 parts: activation, parameter, gradient, optimizer states. How LLM handle memory?

Attention

Self-attention / Multi-head attention (MHA), Multi-Query attention (MQA), Group Query Attention (GQA), MLA (used in DeepSeek)
FlashAttention (paper1, paper for v2, paper for v3, Online softmax, Implementation by Tri Dao
Ring Attention (links to Context Parallelism CP): Handles large sequence length, Flex Attention by PyTorch
KV cache, FP8 KV cache, Paged Attention
Data Parallel (DP) Attention

Core operations

GEMM / MatMul, API of GEMM, GEMM as core of AI, W4A8 GEMM Kernel
MoE (Mixture of experts)
Embedding (deepdive), RoPE (paper)

Quantization

Quantization basics
Different data type stimulations
INT8 quantization using QAT, LLM quantization with PTQ, FP8 datatype, AWQ
Per-tensor and per-block scaling
NVFP4 training, Optimizing FP4 Mixed-Precision Inference on AMD GPUs, Recent LLM quantization progress
Quantization on CPU (GGUF, AWQ, GPTQ), GGUF quantization method, GPTQ: Post training quantization for LLM. OBQ: Post-Training Quantization and Pruning
Mixed precision training
NVFP4, GEMV kernel for NVFP4
Details on FP8 training

Pruning and distillation

Post-training

Optimizations

LLM Inference optimizations; optimizations v2
5D parallelism PP, SP, DP, TP, CP, EP, parallelism concept for LLM scaling. Parallelism in PyTorch
Chunk prefill - SARATHI paper, dynamic and continuous batching
KV cache offloading, KVcache early reuse
Speculative decoding, (basic introduction), Look-ahead reasoning, Paper from Google and DeepMind
MoE using Wide Expert Parallelism EP

Scheduling / Routing

P/D disaggregation, DistServe P/D disaggregation paper
KVCache-centric disaggregated architecture by MooncakeAI
OverFill: Two-Stage Models for Efficient Language Model Decoding by Cornell University

Software tools AI

vLLM arch: architecture of the leading LLM serving engine.

Insights:

MinMax M2 using Full Attention: why full attention is better than masked attention?

Practical:

CUDA Compiler & PTX with example
CUTLASS: template library to code in CUDA easily
Matrix transpose using CUTLASS
SGLang inference engine architecture
FlexAttention using CuTE DSL
MatMul using WGMMA, GEMM with pipelining in CUTLASS

MAYBE guides you may go through

Scaling a model by Jax (Google) (rating 7/10)
Smol training playbook by HuggingFace to train LLMs
GPU Gems 3: if you want to dive deep into GPU programming
(blog) OpenVINO optimizations and engineering by Intel
(blog) Engineering posts by Colfax Research
(blog) GPU MODE lecture notes
(blog) Connectionism- Thinking Machine blog: AI startup. Founded by Mira Murati, former CTO at OpenAI. Solved nondeterminism problem in LLM.
Llama visualization: step by step analyze each tensor as it is processed in Llama

Want to contribute in leading AI open-source projects?

Get started in these:

SGLang: LLM serving engine originally from UC Berkeley.
vLLM: LLM inference engine originally from UC Berkeley.
PyTorch: Leading AI framework by Meta
TensorFlow: AI framework by Google
TensorRT: High performance inference library by NVIDIA
TensorRT-LLM: LLM inference library by NVIDIA
NCCL: High performance GPU communication library by NVIDIA
See other NVIDIA libraries.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-Engineering-Awesome

End to end free guides to follow

Core AI engineering papers you MUST read

Main AI blogs to read regularly (continuous learning)

Deep dive into AI concepts [Learn step-by-step]

MAYBE guides you may go through

Want to contribute in leading AI open-source projects?

About

Uh oh!

Releases

License

SuperIntelligenceAI/AI-Engineering-Awesome

Folders and files

Latest commit

History

Repository files navigation

AI-Engineering-Awesome

End to end free guides to follow

Core AI engineering papers you MUST read

Main AI blogs to read regularly (continuous learning)

Deep dive into AI concepts [Learn step-by-step]

MAYBE guides you may go through

Want to contribute in leading AI open-source projects?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases