LLM-inference-optimization-paper

Summary of some awesome works for optimizing LLM inference

This summary will including three parts:

some repositories that you can follow
some representative person or labs that you can follow
some important works in the different research interests

Repositories

For example, LLMSys-PaperList contains many excellent articles, and is keeping updating (which I believe is the most important for a paperlist). Awesome-LLM-Inference and Awesome_LLM_Accelerate-PaperList are also worth reading.

Besides, awesome-AI-system works also very well. And you can find other repositories in its content.

The log "Large Transformer Model Inference Optimization" helps me a lot at the beginning.

This log OpenAI Keynote on Building Scalable AI Infrastructure seems to be a laeding guidance.

Person/Lab

Follow others' research, and find yourself's idea.

It is not my intention to judge the work of these pioneers, and I understand that the shortness of my knowledge will lead me to leave out many important people. If you have a different opinion, please feel free to communicate with me through the issue.
In no particular order!!
Damn, I can't remember the names of foreigners.

Zhihao JIA: FlexFlow and other imporessive work, important role in MLSys, affiliated with CMU
Tianqi CHEN: TVM, XGBoost, and other imporessive work, important role in Machine Learning System and ML compilers, affiliated with CMU
Song HAN: many important work in efficient ML including sparsity and quantization. btw, the class TinyML and Efficient Deep Learning Computing is highly recommanded, affiliated with MIT Zhen DONG: many important work in quantization and high-performance ML, affiliated with UCB
Tri DAO: author of FlashAttention, affiliated with Princeton
Ce ZHANG: famous in efficient MLsys, affiliated with UChicago
Ion Stoica: Alpa, Ray, Spark, et.al.

SPCL: Scalable Parallel Computing Lab, affiliated with ETHz
Luo MAI: affiliated with University of Edinburgh

IPADS: focus more on PURE systems, buut also make great progress in MLSys, affiliated with SJTU
EPCC: Emerging Parallel Computing Center, parallel computing and MLSys are Naturally combined, affiliated with SJTU

Xin JIN: FastServe and LLMCad are impressive work, affiliated with PKU
Bin CUI: important role in MLSys including DL, GNN, and MoE, affiliated with PKU
Jidong ZHAI: leading many important work in MLSys, affiliated with THU
Lingxiao MA: with many important work in MLSys on Top-Conference, affiliated with MSRA
Cheng LI: high performce system and MLSys, affiliated with USTC
Xupeng Miao: SpotServe, SpecInfer, HET, et.al

Chuan WU: with some important work in distributed machine learning systems, affiliated with HKU James CHENG: affiliated with CUHK
Kai CHEN: database works well with MLSys, affiliated with HKUST
Lei CHEN: database works well with MLSys, many papers so I recommand u to focus on his Top-Conference paper, affiliated with HKUST
Yang YOU: leader of Colossal-AI, affiliated with NUS
Wei WANG: work in System and MLSys, affiliated with HKUST

Work

I hope to conlude these impressive works based on their research direction.
But my summary must not be informative enough, and I am looking forward to your addition.

Perhaps someone should write a detailed survey.

Periodically check the "cited by" of the papers with ⭐ will be helpful.
Paragraphs with 💡 are not perfect.

Survey/Evaluations/Benchmarks 💡

Make useful benchmark or evaluation is helfpul.

Interesting NEW Frameworks in Parallel Decoding

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, pdf

prior paper: Blockwise Parallel Decoding for Deep Autoregressive Models

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding: by lookahead decoding

Both frameworks use parallel decoding, and deserve a more detailed research.

Benchmark LLM Inference framework

vllm-project/aibrix

Papers for Parallel Decoding

There are some interesting papers about parallel decoding.

Complex Inference

In fact, I'm not so familiar with with topic. But perhaps OpenAI 4o1 used this...
Spend more time inferencing than pre-training

⭐ Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: Starter material, apply repeated sampling
⭐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters: Starter material, scaling LLM Test-Time to improve accuracy
Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation: seems fewer people have explore the efficiency of CoT; a two-stage method gives me some throught
Fast Best-of-N Decoding via Speculative Rejection: optimize alignment in inference, accepted by NIPS'24

GPT-o1

This topic is about GPT-o1, aka the strawberry.

⭐ Reverse engineering OpenAI’s o1: a leading blog for introduction in OpenAI’s o1
⭐ Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: base work
Tree of Thoughts: Deliberate Problem Solving with Large Language Models: a improment based on CoT
Large Language Model Guided Tree-of-Thought: also a ToT
Let's Verify Step by Step: verify by step can be helpful
Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models: what is Language Agent Tree Search (LATS)? accepted by ICML'24
Critique-out-Loud Reward Models
Generative Verifiers: Reward Modeling as Next-Token Prediction: a verifier, by DeepMind

Speculative Decoding

Also named as Speculative Sampling, model collaboration.

different model collaboration

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding: use both LLM and SLM

Skeleton-of-Thought

Adaptive Skeleton Graph Decoding: successor of Skeleton-of-Thought

3D Parallelism 💡

Some knowledege about data parallel, model tensor parallel, and model pipeline parallel will help in this track.

Communication Overlap

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models: overlap comm with comp, similar to Liger
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning: accepted by ASPLOS'24
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives: many work about overlap in LLM, accepted by ASPLOS'24
FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion: Fine-grained decomposition, perhaps provide some experiment result
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference: modify the model design for fast decoding, based on comm-comp overlapping
NanoFlow: Towards Optimal Large Language Model Serving Throughput: overlaping based on nano-batch, with some interesting engineer implemntation
Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping: overlapping, provided by Deepspeed team
PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving: overlap communication with model-weights/KV-cache prefetch
Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning: use compilation to schedule overlap, accepted by ASPLOS'25

Prune & Sparsity 💡

An enduring topic in efficient machine learning.
We mainly focus on Semi-structured and Structured pruning becasue they can accelerate computing.

Quantization 💡

Low-precision for memory and computing efficiency.

Batch Processing

Perhaps the most important way for improving the throughput in LLM inference.
This blog Dissecting Batching Effects in GPT Inference helps me a lot at the beginning.

Update2023/12/12: I'd like to use Continues Batching to take place of the Dynamic Batching I used before. The name Dynamic Batching is more likely to be used in Triton.

Computing Optimization

This part include some impressive work optimizing LLM computing by observing the underlying computing properties. Such as FlashAttention, et.al.

FlashAttention Family

Optimization focus on Auto-regressive Decoding

Splitwise: Efficient generative LLM inference using phase splitting: splitting prefill and decode in a map-reduce style, by UW and Microsoft
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: also split the prefill and decode, accepted by OSDI'24
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads: seems a combination of SARATHI and Splitwise
ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference: similar to splitwise, accepted by ASPLOS'24
Splitwiser: Efficient LLM Inference with Constrained Resources
ToEx: Accelerating Generation Stage of Transformer-based Language Models via Token-adaptive Early Exit: Token-adaptive Early Exit

Kernels Optimization

Automatic Task Parallelization of Dataflow Graphs in ML/DL models
MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures: compilation optimization on compuataion graph
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference: optimize attention kernel in mix-batching
Focus: High-Performant and Customizable Attention Engine for LLM Serving: flexible attention engine, advised by Chen Tianqi and accepted by MLSYS'25

Memory Manage

This part is inspired by PagedAttention of vLLM. And there are many Top-Conference paper discussing the memory management in DL computing on GPUs.

Prefix Sharing

note: some papers about prefix sharing is not in this section

LLM Query Scheduling with Prefix Reuse and Latency Constraints: balancing prefix reuse and fairness in query scheduling

Inference on hardware: GPUs, CPUs or based on SSD

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective: a helpful survey

Underlying optimization for GPU

Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library: implement some APIs to reduce the shared memory footprint, accepted in HPC Asia'23
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture: help us understand GPUs
SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving: optimizing energy consuming based on lower GPU frequency
Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference: similar to cutlass, optimization on intel GPU
Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels: for Ascend GPU (perhaps also work for NVIDIA?)

CPUs or based on SSD

Heterogeneous scenarios or single PC are becoming increasingly important.

Making optimization for the calculating on CPU or SSD will have different methods.

Inference on personal device

Inspired by AI PC, open up a new area.
Including edge systems now.

Heterogeneous or decentralized environments

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs: decentrailized system on consumer-level GPUs, through there will be some problems
Distributed Inference and Fine-tuning of Large Language Models Over The Internet: some techs in this paper will be instructive
⭐ HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices: heterogeneous parallel computing using CPUs and GPUs
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs: accepted by ATC'24
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: we can get performance model for Heterogeneous GPUs cluster and learn the algorithm analyse
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity: making heterogeneity-aware GPU provisioning decisions for LLM serving

Algorithm Optimization 💡

In this part, researchers provide some algorithm-based method to optimizing LLM inference.

Industrial Inference Frameworks 💡

LLM Serving 💡

LLM server providers will focus on this part. Engineering practices are just as important as algorithm optimization.

LLM as microservice

⭐ A System for Microserving of LLMs: seems a idea and industrial practice that makes sense

Serverless LLM serving

DeepFlow: Serverless Large Language Model Serving at Scale: provide fine-grained LLM service
⭐ Towards Swift Serverless LLM Cold Starts with ParaServe: pipeline parallelism and dynamic adjust parallelism strategy, and accelerate cold-start
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference: serverless inference system to achieve fast model scaling, by fast model multicast, inference execution during model transmission and dynamically constructs execution pipelines
Medusa: Accelerating Serverless LLM Inference with Materialization: target at cold-start of LLM serverlesss, to solve the available KV cache blocks profiling and cuda graph capture problems, accepted by ASPLOS'25

Request Scheduling

Shared Prefix Serving

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: share prefix and optimize KV Cache

Serving for LoRA

For LoRA but not serving

Combining fine-tuning/training with inference

Deferred Continuous Batching in Resource-Efficient Large Language Model Serving
Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses: place training and inference together, control the inference latency to the desired SLO, while maximizing the throughput of the training jobs co-located on the same GPUs, accepted by ICDCS'24

Serving Long-Context

Long-Context is a hot point recently.

Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: like a update for H2O or Dejevu, et.al, each attention head have different memory budget
Context Parallelism for Scalable Million-Token Inference
TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection: select some important KV cache to take part in attention computation

Complex ML loads

Process differnet ML loads in a cluster.

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters: serve multiple different loads in GPU cluster, accepted by SC'24
PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption: why Encryption in LLM inference? by IPADS, accepted by ASPLOS'25
Topology-aware Preemptive Scheduling for Co-located LLM Workloads: schedule different workloads

RAG with LLM

Combine MoE with LLM inference

Here are two repositories have some papers for MoE: Papers: MoE/Ensemble, and MOE papers to read

MoE training

Inference with multimodal

MOSEL: Inference Serving Using Dynamic Modality Selection: improving system throughput by 3.6x with an accuracy guarantee and shortening job completion times by 11x
Generative AI Beyond LLMs: System Implications of Multi-Modal Generation: by META
Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations: by Google
Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference: optimization for diffusion models by cache
DISTMM: Accelerating distributed multimodal model training: helpful although it is made for training, accepted by NSDI'24
Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training: distributed MM trainging
Efficiently serving large multimedia models using EPD Disaggregation
MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving: position-independent caching, with both reuse and recompute, may lead to performance loss
Characterizing and Efficiently Accelerating Multimodal Generation Model Inference: some insights

Training in Multimodal

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: disaggregation in MM training, under guidence of Xin JIN
Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management: efficient MM model training

Diffusion Models

Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models: serving Diffusion models, accepted by NSDI'24
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines: accepted by MLSys'24
SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules: more papers in diffusion models
PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving: algorithm-based framework

Compound Inference Systems

What is this? maybe multiple LLM?

LLM Application

Teola: Towards End-to-End Optimization of LLM-based Applications: endd-to-end optimization
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable: accepted by OSDI'24
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications: many LLM apps share GPU, accepted by EuroSys'24
Autellix: An Efficient Serving Engine for LLM Agents as General Programs: multi-agent has something similar to LLM application, scheduling and preemption
Fast Inference for Augmented Large Language Models: seems a subclass of multi-agent

Fault Tolerance

Characterization of Large Language Model Development in the Datacenter: fault-tolerant serving in the future?
Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement: Fault Tolerance in MoE training
Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training: checkpointing in MoE

Energy Optimization

It is usually related to CPU-GPU heterogeneity and GPU power consumption.

Early Exits

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving: early exits, accepted by SOSP'24
Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation: early exits and some system optimization, accepted by SOSP'24

Some Interesting Idea

Wise men learn by others.

Dataflow

I'd like to create a separate area for data flows. It's just my preference.

⭐ FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks: dataflow in inference
Pathways: Asynchronous Distributed Dataflow for ML: accepted by MLSys'22
VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware: accepted by MLSys'22

How about data pre-processing overhead in training?

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement

GNN

Just my preference.

Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication
GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks: accepted by IPDPS'24
NPA: Improving Large-scale Graph Neural Networks with Non-parametric Attention: SIGMOD'24
Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression: compress node features in graph, accepted by VLDB'24
Mega: More Efficient Graph Attention for GNNs: optimize graph attention efficiency, ICDCS'24
TORCHGT: A Holistic System for Large-Scale Graph Transformer Training: graph transformer model

Blockchain

Just my preference, too.

Weaving the Cosmos: WASM-Powered Interchain Communication for AI Enabled Smart Contracts

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.gitignore		.gitignore
ConferenceList_24.md		ConferenceList_24.md
ConferenceList_25.md		ConferenceList_25.md
README.md		README.md

chenhongyu2048/LLM-inference-optimization-paper

Folders and files

Latest commit

History

Repository files navigation