This summary will including three parts:
- some repositories that you can follow
- some representative person or labs that you can follow
- some important works in the different research interests
For example, LLMSys-PaperList contains many excellent articles, and is keeping updating (which I believe is the most important for a paperlist). Awesome-LLM-Inference and Awesome_LLM_Accelerate-PaperList are also worth reading.
Besides, awesome-AI-system works also very well. And you can find other repositories in its content.
The log "Large Transformer Model Inference Optimization" helps me a lot at the beginning.
This log OpenAI Keynote on Building Scalable AI Infrastructure seems to be a laeding guidance.
Follow others' research, and find yourself's idea.
It is not my intention to judge the work of these pioneers, and I understand that the shortness of my knowledge will lead me to leave out many important people.
If you have a different opinion, please feel free to communicate with me through the issue.
In no particular order!!
Damn, I can't remember the names of foreigners.
Zhihao JIA: FlexFlow and other imporessive work, important role in MLSys, affiliated with CMU
Tianqi CHEN: TVM, XGBoost, and other imporessive work, important role in Machine Learning System and ML compilers, affiliated with CMU
Song HAN: many important work in efficient ML including sparsity and quantization. btw, the class TinyML and Efficient Deep Learning Computing is highly recommanded, affiliated with MIT
Zhen DONG: many important work in quantization and high-performance ML, affiliated with UCB
Tri DAO: author of FlashAttention, affiliated with Princeton
Ce ZHANG: famous in efficient MLsys, affiliated with UChicago
Ion Stoica: Alpa, Ray, Spark, et.al.
SPCL: Scalable Parallel Computing Lab, affiliated with ETHz
Luo MAI: affiliated with University of Edinburgh
IPADS: focus more on PURE systems, buut also make great progress in MLSys, affiliated with SJTU
EPCC: Emerging Parallel Computing Center, parallel computing and MLSys are Naturally combined, affiliated with SJTU
Xin JIN: FastServe and LLMCad are impressive work, affiliated with PKU
Bin CUI: important role in MLSys including DL, GNN, and MoE, affiliated with PKU
Jidong ZHAI: leading many important work in MLSys, affiliated with THU
Lingxiao MA: with many important work in MLSys on Top-Conference, affiliated with MSRA
Cheng LI: high performce system and MLSys, affiliated with USTC
Xupeng Miao: SpotServe, SpecInfer, HET, et.al
Chuan WU: with some important work in distributed machine learning systems, affiliated with HKU
James CHENG: affiliated with CUHK
Kai CHEN: database works well with MLSys, affiliated with HKUST
Lei CHEN: database works well with MLSys, many papers so I recommand u to focus on his Top-Conference paper, affiliated with HKUST
Yang YOU: leader of Colossal-AI, affiliated with NUS
Wei WANG: work in System and MLSys, affiliated with HKUST
I hope to conlude these impressive works based on their research direction.
But my summary must not be informative enough, and I am looking forward to your addition.
Perhaps someone should write a detailed survey.
Periodically check the "cited by" of the papers with ⭐ will be helpful.
Paragraphs with 💡 are not perfect.
- ⭐ Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models: evaluations helps you find the bottleneck
- ⭐ Full Stack Optimization of Transformer Inference: a Survey: a survey by UCB
- ⭐ Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: worth a read
- ⭐ Deep Learning Workload Scheduling in GPU Datacenters: A Survey: survey for GPU Datacenters DL Workload Scheduling
- ⭐ Towards Efficient and Reliable LLM Serving: A Real-World Workload Study: a benchmark for LLM serving
- ⭐ LLM Inference Unveiled: Survey and Roofline Model Insights: both survey and analysis
- A SURVEY OF RESOURCE-EFFICIENT LLM AND MULTIMODAL FOUNDATION MODELS: worth reading
- Training and Serving System of Foundation Models: A Comprehensive Survey
- Model Compression and Efficient Inference for Large Language Models: A Survey
- ⭐ Towards Coarse-to-Fine Evaluation of Inference Efficiency for Large Language Models
- ⭐ A Survey on Efficient Inference for Large Language Models: worth reading
- Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models
- ⭐ Navigating Challenges and Technical Debt in Large Language Models Deployment: important
- The CAP Principle for LLM Serving: anothor angle
- Demystifying Data Management for Large Language Models: talking about database in LLM, by Xupeng MIAO, accepted by SIDMOD'24
- Benchmarking LLM Inference Backends: vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI: with code
- A Survey on Mixture of Experts
- Analyzing LLM performance: The impact of high-bandwidth memory on model inference: analyze of inference
- Inference Optimization of Foundation Models on AI Accelerators
- LLM Inference Serving: Survey of Recent Advances and Opportunities: newest
- A Survey on Mixture of Experts
- LLM Inference Serving: Survey of Recent Advances and Opportunities: better than nothing
- Contemporary Model Compression on Large Language Models Inference: survey in model compression
- ⭐ Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning: bring insights for MLSys
- Resource-efficient Algorithms and Systems of Foundation Models: A Survey
- ⭐ A Survey on Inference Optimization Techniques for Mixture of Experts Models: asurvey on MoE models
- Deploying Foundation Model Powered Agent Services: A Survey: survey for AI agent service
Make useful benchmark or evaluation is helfpul.
-
MLPerf Inference Benchmark: inference github, a well-known benchmark
-
llmperf: evaluate both performance and correctness, but based on ray
-
The Importance of Workload Choice in Evaluating LLM Inference Systems: important angles in LLM inference systems
-
Vidur: A Large-Scale Simulation Framework For LLM Inference: test the performance of LLM inference
-
Metron: Holistic Performance Evaluation Framework for LLM Inference Systems: an evaluation framework
-
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale: a Simulator
-
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators: inference + hardware
-
Towards Efficient Large Multimodal Model Serving: a survey on mm serving, and a decoupled serving architecture that enables independent resource allocation and adaptive scaling for each stage
-
LLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference: a performance evaluation framework, can be used to estimate the time cost
-
Predicting LLM Inference Latency: A Roofline-Driven ML Method: predict inference performance based on Roofline
-
GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments: a work for predict LLMSys performance
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, pdf
prior paper: Blockwise Parallel Decoding for Deep Autoregressive Models
Break the Sequential Dependency of LLM Inference Using Lookahead Decoding: by lookahead decoding
Both frameworks use parallel decoding, and deserve a more detailed research.
There are some interesting papers about parallel decoding.
- Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
- ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
- APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding: how to make it auto-parallel?
In fact, I'm not so familiar with with topic. But perhaps OpenAI 4o1 used this...
Spend more time inferencing than pre-training
- ⭐ Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: Starter material, apply repeated sampling
- ⭐ Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters: Starter material, scaling LLM Test-Time to improve accuracy
- Strategic Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation: seems fewer people have explore the efficiency of CoT; a two-stage method gives me some throught
- Fast Best-of-N Decoding via Speculative Rejection: optimize alignment in inference, accepted by NIPS'24
This topic is about GPT-o1, aka the strawberry.
- ⭐ Reverse engineering OpenAI’s o1: a leading blog for introduction in OpenAI’s o1
- ⭐ Chain-of-Thought Prompting Elicits Reasoning in Large Language Models: base work
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models: a improment based on CoT
- Large Language Model Guided Tree-of-Thought: also a ToT
- Let's Verify Step by Step: verify by step can be helpful
- Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models: what is Language Agent Tree Search (LATS)? accepted by ICML'24
- Critique-out-Loud Reward Models
- Generative Verifiers: Reward Modeling as Next-Token Prediction: a verifier, by DeepMind
Also named as Speculative Sampling, model collaboration.
- ⭐ Accelerating Large Language Model Decoding with Speculative Sampling: opening of Speculative Decoding, by DeepMind
- ⭐ Fast inference from transformers via speculative decoding: work of similar period with the upper one, by Google, accepted by ICML'23
- SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification: paper under guidance of Zhihao JIA, use Tree decoding and a set of draft models
- LLMCad: Fast and Scalable On-device Large Language Model Inference: paper under guidance of Xin JIN, speculative decoding for on-device LLM inference based on tree decoding and other optimizations
- Speculative Decoding with Big Little Decoder: similar to speculative decoding, accepted in NIPS'23
- Online Speculative Decoding: update draft model online
- Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding: the trade-off analyse deserves a reading
- The Synergy of Speculative Decoding and Batching in Serving Large Language Models: analyse for combining the spec decoding with batching
- REST: Retrieval-Based Speculative Decoding: use retrieval for spec decoding, some familiar names in the authors list
- Cascade Speculative Drafting for Even Faster LLM Inference: by UIUC
- Multi-Candidate Speculative Decoding: multiple draft models
- ⭐ Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding: survey for Speculative Decoding
- BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding: a work with Yang YOU's name
- Decoding Speculative Decoding: provide some insight into the selection of draft models
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting: perhaps tree specualtive decoding?
- ⭐ Speculative Streaming: Fast LLM Inference without Auxiliary Models: a promising method for speculative decoding
- Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding: accelerating spec decoding
- Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens: accelerate spec decoding with Fusing all tokens
- Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding: using several SSMs, adaptive SSM prediction length, pipelining SSM decode and LLM verify
- Recurrent Drafter for Fast Speculative Decoding in Large Language Models
- Optimal Block-Level Draft Verification for Accelerating Speculative Decoding
- Accelerating LLM Inference with Staged Speculative Decoding: token tree and a second stage of speculative decoding
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding: combine KV cache with spec decoding
- EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models: algorithm optimization in spec decoding
- SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices: any difference with specinfer?
- Optimizing Speculative Decoding for Serving Large Language Models Using Goodput: model the speculative decoding length
- MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding: spec decoding for long-context
- QSpec: Speculative Decoding with Complementary Quantization Schemes: spec decoding with quantization, a novel A+B
- Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement: optimization ob Medusa
- The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation: use learning-free, negligible-cost draft strategies, namely N-grams obtained from the model weights and the context
- EdgeLLM: Fast On-device LLM Inference with Speculative Decoding: seem a extended work of LLMCad
- AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decodin: a speculation-and-selection scheme, that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints
- Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding: use both LLM and SLM
- Adaptive Skeleton Graph Decoding: successor of Skeleton-of-Thought
Some knowledege about data parallel, model tensor parallel, and model pipeline parallel will help in this track.
- ⭐ Efficiently Scaling Transformer Inference: use model parallel to accelerating inference, by Google, in MLSys'23
- HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment: a distributed inference engine that supports asymmetric partitioning of the inference computation
- InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding: Efficient Long-sequence training
- Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference: accepted by PPoPP'24
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs: full-stack approach of LLM training
- DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers: sequence parallel by Yang YOU
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism: Elastic Sequence Parallelism?
- GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism: this could be potential in inference
- TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models: pipeline parallism
- QUART: Latency-Aware FaaS System for Pipelining Large Model Inference: pipeline in serving and fast expanding
- Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations: optimize sequence parallel
- CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts: optimize sequence parallel
- ⭐ PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation: pipeline parallelism and speculation, accepted by SC'24
- HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment: algorithm analyse for resource allocation, parallel strategy and kv transfer in disaggreagting llm system
- ADOR: A Design Exploration Framework for LLM Serving with Enhanced Latency and Throughput: explores design spaces to suggest architectures that meet the requirements of both vendors and users
- Seesaw: High-throughput LLM Inference via Model Re-sharding: dynamic model re-sharding, facilitates the dynamic reconfiguration of parallelization strategies across prefill-decode stages, accepted by MLSYS'25
- PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training: fill the bubbles with other GPU workload
- Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models: overlap comm with comp, similar to Liger
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning: accepted by ASPLOS'24
- T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives: many work about overlap in LLM, accepted by ASPLOS'24
- FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion: Fine-grained decomposition, perhaps provide some experiment result
- Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference: modify the model design for fast decoding, based on comm-comp overlapping
- NanoFlow: Towards Optimal Large Language Model Serving Throughput: overlaping based on nano-batch, with some interesting engineer implemntation
- Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping: overlapping, provided by Deepspeed team
- PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving: overlap communication with model-weights/KV-cache prefetch
- Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning: use compilation to schedule overlap, accepted by ASPLOS'25
An enduring topic in efficient machine learning.
We mainly focus on Semi-structured and Structured pruning becasue they can accelerate computing.
-
⭐ Accelerating Sparse Deep Neural Networks: use N:M sparsity to fully utilize the hardware for accelerating, by Nvidia
-
⭐ Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time: interesting paper in using sparsity, under guidence of Tri DAO and Ce ZHANG, accepted in ICML'23
-
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
-
Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism: accepted by PPoPP'23
-
⭐ PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation: A novel way to deal with dynamic sparsity may be used for GNN and MoE, accepted by SOSP'23
-
DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving: seem a follow-up work of Deja Vu, also focus on KV-Cache
-
FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inferenc: sparsity in FFN
-
ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models: a simple and effective sparsification method named "ProSparse"
-
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters: work for powerinfo
-
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations: pruning for LLM
-
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention: inference framework based on sparse attention, by Microsoft
-
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models: use ReLU to imporve Sparsity, just like powerinfer
-
CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation: algorithm optimization that can utilize sparsity to accelerate inference
-
Star Attention: Efficient LLM Inference over Long Sequences: a two-phase block-sparse approximation
-
Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries: use Sparse Coding over Universal Dictionaries to compress KV cache, it's novelty
-
SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters: algorithm to replace a layer with the previous Adjacent layer and Recovery Parameters(based on finetune), to decrease memory overhead
-
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking: accepted by MLSYS'25
Low-precision for memory and computing efficiency.
- Understanding and Overcoming the Challenges of Efficient Transformer Quantization
- ⭐ LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: by UW
- ⭐ SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models: paper under guidance of Song HAN
- ⭐ AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration: paper under guidance of Song HAN
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving: paper under guidance of Tianqi CHEN, quantization is not important, designing how to quantify is important, in review of MLSys'24
- FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
- QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
- Understanding the Impact of Post-Training Quantization on Large Language Models: tech report will help
- ⭐ LLM-FP4: 4-Bit Floating-Point Quantized Transformers: by HKUST, accepted in EMNLP'23
- ⭐ Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization: by SJTU, accepted in DAC'24
- INT4 Wight + FP8 KV-Cache: optimization for LLM inference: INT4 Wight + FP8 KV-Cache + Continues batching
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization: quant KV cache
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference: simple and crude optimization work
- LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization: for Heterogeneous Clusters and Adaptive Quantization, under guidence of Chuan WU, accepted by PPoPP'24(poster)
- IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact: use pivot token
- QAQ: Quality Adaptive Quantization for LLM KV Cache
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving: quantization in inference, under guidence of Song HAN
- Does compressing activations help model parallel training?: analyse in compressing(including pruning and quantization) in MP training, accepted by MLSys'24
- Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression: compress KV cache with quantization
- Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs: with targeted activate function
- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design: FPx quantization, accepted by ATC'24
- Demystifying the Compression of Mixture-of-Experts Through a Unified Framework: combine quantization with MoE
- Does Compressing Activations Help Model Parallel Training?: quantization Activation?
- PQCache: Product Quantization-based KVCache for Long Context LLM Inference: apply quantization and Maximum Inner-Product Search for KV Cache compression
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs: provide efficient kernels for lookup quantization
- Ladder: Enabling Efficient Low-Precision Deep Learning Computing through Hardware-aware Tensor Transformation: a computation optimization for Low-Precision
- Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs: a computation optimization for 6-bit LLM
- Mixture of Experts with Mixture of Precisions for Tuning Quality of Service: quantization on MoE models
- Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference: compress the KV Cache
- ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models: quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents
- Progressive Mixed-Precision Decoding for Efficient LLM Inference: gradual lowering of precision deeper in the generated sequence, together with a spectrum of precision-switching schedulers
- COMET: Towards Partical W4A4KV4 LLMs Serving: provide quantization algorithm, quantization kernel and SM schedule method
- MixQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction: quantization with outliers, optimization on AWQ, accepted by SC'24
- Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference: low-bit compression to accelerate communication
- Unifying KV Cache Compression for Large Language Models with LeanKV: combine quantization and sparity to compress KV cache
- MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design: mix quantization, effectively assigning the larger bit-width to output features that need it most to achieve good accuracy with low memory consumption
- KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference: KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference
- HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference: quantization to decrease kvc transfer overhead in disaggregation and eliminate kv dequantization
- MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models: Mixed-precision Auto-Regressive LINear kernels, accepted by PPoPP'25
- MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators: augments highly quantized MoEs with a mixture of low-rank compensators, provide 3-bit tensorcore kernels, accepted by MLSYS'25
Perhaps the most important way for improving the throughput in LLM inference.
This blog Dissecting Batching Effects in GPT Inference helps me a lot at the beginning.
Update2023/12/12: I'd like to use Continues Batching
to take place of the Dynamic Batching
I used before. The name Dynamic Batching
is more likely to be used in Triton.
- ⭐ Orca: A Distributed Serving System for Transformer-Based Generative Models: Continues batch processing without redundant computing, accepted in OSDI'23
- Fast Distributed Inference Serving for Large Language Models: considering Job Completion Time(JCT) in LLM serving, paper under guidance of Xin JIN
- Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline: schedule based on response length prediction by LLM, paper under guidance of Yang YOU
- S3: Increasing GPU Utilization during Generative Inference for Higher Throughput: idea similar to above, by Harvard University
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills: blocking the prefill phase and reduce pipeline bubbles, by MSRIndia
- Flover: A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference: accepted by HiPC'23
- Handling heavy-tailed input of transformer inference on GPUs: accepted by ICS'22
- CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system: Some form of inference service
- TCB: Accelerating Transformer Inference Services with Request Concatenation: perhaps similar to ByteTransformer, accepted by ICPP'22
- Fairness in Serving Large Language Models: under guidence of Ion Stoica, accepted by OSDI'24
- Characterizing and understanding deep neural network batching systems on GPUs: benchmarking is important
- Hydragen: High-Throughput LLM Inference with Shared Prefixes
- RelayAttention for Efficient Large Language Model Serving with Long System Prompts: think about the memory access of KV cache
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve: follow-up work of sarathi
- Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction: predict length
- LiveMind: Low-latency Large Language Models with Simultaneous Inference: perform inferences with incomplete prompts, to take advantage of streaming prompt
- A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length: theoretical analysis of latency
- ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG
- Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models: seems similar to ORCA or bytetransformer?
- BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching: optimization on ORCA, dynamic re-batching
- EcoServe: Maximizing Multi-Resource Utilization with SLO Guarantees in LLM Serving: A fusion monster with a variety of optimization techniques
- ⭐ AcceLLM: Accelerating LLM Inference using Redundancy for Load Balancing and Data Locality: what's Redundancy
This part include some impressive work optimizing LLM computing by observing the underlying computing properties. Such as FlashAttention, et.al.
- ⭐ FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: one of the most important work these years, both simple and easy to use, by Tri DAO
- ⭐ FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning: you'd better not ignore it
- ⭐ Flash-Decoding for long-context inference: you'd better not ignore it, too
- ⭐ Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity: successor to FlashAttention in inference, accepted by VLDB'24
- ⭐ FlashDecoding++: Faster Large Language Model Inference on GPUs: worth reading, FLashDecoding follow-up
- SubGen: Token Generation in Sublinear Time and Memory
- DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference
- Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers: modification in self-attention
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
- Flex Attention: A Programming Model for Generating Optimized Attention Kernels: auto-generated attention kernel
- Splitwise: Efficient generative LLM inference using phase splitting: splitting prefill and decode in a map-reduce style, by UW and Microsoft
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving: also split the prefill and decode, accepted by OSDI'24
- Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads: seems a combination of SARATHI and Splitwise
- ExeGPT: Constraint-Aware Resource Scheduling for LLM Inference: similar to splitwise, accepted by ASPLOS'24
- Splitwiser: Efficient LLM Inference with Constrained Resources
- ToEx: Accelerating Generation Stage of Transformer-based Language Models via Token-adaptive Early Exit: Token-adaptive Early Exit
- Automatic Task Parallelization of Dataflow Graphs in ML/DL models
- MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures: compilation optimization on compuataion graph
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference: optimize attention kernel in mix-batching
- Focus: High-Performant and Customizable Attention Engine for LLM Serving: flexible attention engine, advised by Chen Tianqi and accepted by MLSYS'25
This part is inspired by PagedAttention of vLLM. And there are many Top-Conference paper discussing the memory management in DL computing on GPUs.
-
⭐ Efficient Memory Management for Large Language Model Serving with PagedAttention: memory page management for the KV-Cache in Attention-type model, accepted by SOSP'23 (many papers will cite the vLLM project instead of their paper, which makes it harder for us to find its citated by)
-
⭐ AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs: cache management for inference, accepted by MLSys'23
-
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs: block-based data layout, accepted by TACO'October-2023
-
AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems: a unique observation that there is rich similarity in attention computation across inference sequences
-
BPIPE: memory-balanced pipeline parallelism for training large language models: memory balance perhaps can work well in inferencce, by SNU, accepted by ICML'23
-
Improving Large Language Model Throughput with Efficient LongTerm Memory Management: perhaps a new view
-
CacheGen: Fast Context Loading for Language Model Applications
-
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
-
Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models: consider the memory consumption in fine-tuning
-
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference
-
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference: compress KV Cache
-
LLM as a System Service on Mobile Devices: LLM as a service on Mobile devices
-
DistMind: Efficient Resource Disaggregation for Deep Learning Workloads: by Xin JIN, accepted by ToN'Jan24
-
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching: sparsity in KV Cache, accepted by ISCA'24
-
AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving: a hierarchical KV caching system that leverages cost-effective memory/storage mediums to save KV caches for all requests
-
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention: improve PagedAttention
-
Layer-Condensed KV Cache for Efficient Inference of Large Language Models: only computes and caches the KVs of a small number of layers
-
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models: compress KV cache
-
CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion: very popular idea recently
-
Block Transformer: Global-to-Local Language Modeling for Fast Inference: build KV Cache block from many tokens' KV Cache
-
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool: KV Cache management in P/D disaggregation arch
-
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention: multi-round chat and memory management, accepted by ATC'24
-
Stateful Large Language Model Serving with Pensieve: similar to cachedattention
-
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving: P/D disaggregation archtecture and KV Cache management
-
P/D-Serve: Serving Disaggregated Large Language Model at Scale: a P/D based system, with D2D access optimization
-
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management: offload KV Cache
-
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption: a survey for optimizing KV Cache
-
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving: tensor management especially for llm inference
-
Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation: remove unimportant tokens in KV Cache
-
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving: compression and streaming transfering of KV Cache, accepted by SIGCOMM'24
-
Compute Or Load KV Cache? Why Not Both?: recompute and load together for long context
-
LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management: manage KV Cache by layers
-
Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching: compress KV cache and multi-level memory
-
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models: better prefix-cache
-
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference: Low-rank KV cache and dynamic rebuild KV cache
-
⭐ VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration: the first work I see that optimize KV cache in vision models
-
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction: KV cache page evict and recall, accepted by NIPS'24
-
SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation: Optimization on Zero? redesign the data flow of heterogeneous hardware and sharded model training to minimize the excessive communication overhead, accepted by NIPS'24
-
⭐ KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management: memory management for KV cache and parameter, seems a novel work considering the weights migration
-
SYMPHONY: Improving Memory Management for LLM Inference Workloads: dynamically migrates K,V caches to enable finegrained scheduling of inference requests
-
Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management: efficiently migrate requests and their KV cache among GPUs
-
Efficient LLM Inference with Activation Checkpointing and Hybrid Caching: recompute+cache for KV cache management, only recompute attention(no projection)
-
Memory Offloading for Large Language Model Inference with Latency SLO Guarantees: offload kv cache to CPU memory
-
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving: sparse attention is hot recently, dynamic kvcache budget and efficient kvc loading from CPU
-
Efficient and scalable huge embedding model training via distributed cache management: staleness and skewed popularity distributions based cache
-
BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference: different kv heads have different importance, then offload and compress
-
Fast State Restoration in LLM Serving with HCache: cache for offloading kvc to CPU, accepted by EuroSys'25
note: some papers about prefix sharing is not in this section
- LLM Query Scheduling with Prefix Reuse and Latency Constraints: balancing prefix reuse and fairness in query scheduling
- Reducing shared memory footprint to leverage high throughput on Tensor Cores and its flexible API extension library: implement some APIs to reduce the shared memory footprint, accepted in HPC Asia'23
- Benchmarking and Dissecting the Nvidia Hopper GPU Architecture: help us understand GPUs
- SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving: optimizing energy consuming based on lower GPU frequency
- Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference: similar to cutlass, optimization on intel GPU
- Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels: for Ascend GPU (perhaps also work for NVIDIA?)
Heterogeneous scenarios or single PC are becoming increasingly important.
Making optimization for the calculating on CPU or SSD will have different methods.
-
Efficient LLM Inference on CPUs: LLMs with quantization on CPUs, by Intel, accepted by NIPS'23
-
Inference Performance Optimization for Large Language Models on CPUs: xFasterTransformer, LLMs inference optimization on CPUs, by Intel
-
Distributed Inference Performance Optimization for LLMs on CPUs: similar work to above, by Intel
-
Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference: inference on CPU based on advanced hardware
-
TURNIP: A "Nondeterministic" GPU Runtime with CPU RAM Offload: free to run operations such as GPU kernel calls in many different orders
-
Improving Throughput-oriented Generative Inference with CPUs: cooperate of CPUs and GPU, accepted by APSys'23
-
Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs: execute the operators on the CPU and GPU in parallel, by SJTU
-
EdgeNN: Efficient Neural Network Inference for CPU-GPU Integrated Edge Devices: inference on edge devices, accepted by ICDE'23
-
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU: by SJTU IPADS
-
LLM in a flash: Efficient Large Language Model Inference with Limited Memory: by Apple
-
Efficient LLM inference solution on Intel GPU: intel GPU is interesting
-
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines: efficient serving with CPU-GPU system
-
Efficient and Economic Large Language Model Inference with Attention Offloading: similar to FastDecode
-
Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference: similar to fastdecode: cpu for attention and gpu for others
-
Petals: Collaborative Inference and Fine-tuning of Large Models: looks like heterogeneous resources are being utilized
-
NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention
-
⭐ A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors: use CPU for DL, accepted by ASPLOS'24
-
LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control: based on offload
-
T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge: computation on CPU with quantization
-
TBA: Faster Large Language Model Training Using SSD-Based Activation Offloading: how to use SSD?
-
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference: offload KV Cache to CSD(Computational Storage Drive)
-
TwinPilots: A New Computing Paradigm for GPU-CPU Parallel LLM Inference: some idea in using CPU
-
Improving Throughput-oriented LLM Inference with CPU Computations: pipeline in CPU-GPU inference
-
Understanding Performance Implications of LLM Inference on CPUs: analyse of using CPU for inference
-
GPUs, CPUs, and... NICs: Rethinking the Network's Role in Serving Complex AI Pipelines: NIC can be important, especially in communication
-
Pie: Pooling CPU Memory for LLM Inference: use CPU memory to enlarge batchsize to improve throughput, by Ion Stoica
-
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference: offload KV cache and attention to CPU for larger batchsize, similar to fastdecode, by Ion Stoica
-
Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems: more likely inference on personal device
-
Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation: use recomputation and transfer to re-produce KV cache; can use their run-time and split parallelism
Inspired by AI PC, open up a new area.
Including edge systems now.
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU: inference a 30B model with a 16GB GPU, accepted by ICML'23
- LLM as a System Service on Mobile Devices: an intro for LLM on private devices
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU: based on sparsity in NN Layers
- ⭐ LLM for Mobile: An Initial Roadmap: a road map
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone: work on smartphone
- Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM: on edge devices, accepted by MICRO'24
- ⭐ HeteroLLM: Accelerating Large Language Model Inference on Mobile SoCs with Heterogeneous AI Accelerators: features on mobile SoCs, tensor partition strategy, to do Heterogeneous AI inference
- PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks: cloud(LLM)-edge(SmallLM) collaboration
- FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference: offloading based framework, asynchronous prefetching, balanced memory locking, and flexible tensor preservation
- Fast On-device LLM Inference with NPUs: chunked prefill, offload outlier to CPU/GPU, schedule computation to NPU/CPU/GPU, accepted by ASPLOS'25
- FlexInfer: Flexible LLM Inference with CPU Computations: offload kvc and weights to CPU, accepted by MLSYS'25
-
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs: decentrailized system on consumer-level GPUs, through there will be some problems
-
Distributed Inference and Fine-tuning of Large Language Models Over The Internet: some techs in this paper will be instructive
-
⭐ HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices: heterogeneous parallel computing using CPUs and GPUs
-
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs: accepted by ATC'24
-
Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs: we can get performance model for Heterogeneous GPUs cluster and learn the algorithm analyse
-
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity: making heterogeneity-aware GPU provisioning decisions for LLM serving
In this part, researchers provide some algorithm-based method to optimizing LLM inference.
- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models: accepted by NIPS'23
- Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time: consider the different importance of tokens in KV Cache, similar to H2O
- ⭐ SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference: skipping maybe an useful method like spec decoding
- Inference with Reference: Lossless Acceleration of Large Language Models: also a potential optimization
- Efficient Streaming Language Models with Attention Sinks: streaming LLM for infinite sequence lengths, by MIT and under guidence of Song HAN
- Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference: also important tokens, just like H2O, accepted by MLSys'24
- Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache: an optimization to H2O, accepted by MLSys'24
- RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval: use approximate nearest neighbor search to search the most relevant KV cache
- CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs: based on observation: adjacent query tokens tend to focus on similar subsets of the past Key-Value (KV) cache
- TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention: sparse attention
- SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation: algorithm optimization for less KV Cache
- Activation Sequence Caching: High-Throughput and Memory-Efficient Generative Inference with a Single GPU: use characterization results to optimize KV Cache management
- ⭐ DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale: you must know DeepSpeed
- DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference
- DeepSpeed Model Implementations for Inference (MII)
- ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs: developed by ByteDance, accepted by IPDPS'23
- TurboTransformers: an efficient GPU serving system for transformer models: by Tencent Inc, accepted by PPoPP'21
- Accelerating Generative AI with PyTorch II: GPT, Fast: a blog in PyTorch, use only PyTorch code, gpt-fast
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving: based on FlexFlow
- FlashInfer: Kernel Library for LLM Serving
- Efficiently Programming Large Language Models using SGLang: we can get some optimization from here
- Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models: different parallel, by Tencent
LLM server providers will focus on this part. Engineering practices are just as important as algorithm optimization.
-
⭐ AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving: accepted by OSDI'23
-
⭐ STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining: Elastic will be important in the future, accepted by ASPLOS'23
-
INFaaS: Automated Model-less Inference Serving: accepted by ATC'21
-
Tabi: An Efficient Multi-Level Inference System for Large Language Models: under guidence of Kai CHEN, accepted by EuroSys'23
-
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance: cost is the service provider cares most
-
FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping
-
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning: accepted by NSDI'23
-
Cocktail: A Multidimensional Optimization for Model Serving in Cloud: model ensembling, accepted by NSDI'22
-
SLA-Driven ML INFERENCE FRAMEWORK FOR CLOUDS WITH HETEROGENEOUS ACCELERATORS: accepted by MLSys'22
-
FaST-GShare: Enabling Efficient Spatio-Temporal GPU Sharing in Serverless Computing for Deep Learning Inference: accepted by ICPP'23
-
Flashpoint: A Low-latency Serverless Platform for Deep Learning Inference Serving
-
BATCH: Machine Learning Inference Serving on Serverless Platforms with Adaptive Batching: accepted by SC'20
-
MArk: exploiting cloud services for cost-effective, SLO-aware machine learning inference serving: accepted by ATC'19
-
⭐ MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters: challenges and solutions in real-world scenarios, accepted by NSDI'22
-
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads: under the guidence of Ion Stoica
-
Learned Best-Effort LLM Serving: a best-effort serving system of UCB
-
Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences: accepted by OSDI'22, enables microsecond-scale kernel preemption and controlled concurrent execution in GPU scheduling
-
PipeSwitch: fast pipelined context switching for deep learning applications: PipeSwitch, a system that enables unused cycles of an inference application to be filled by training or other inference applications, accepted by OSDI'20
-
⭐ Paella: Low-latency Model Serving with Software-defined GPU Scheduling: how the tasks are scheduled to GPUs, accepted by SOSP'23
-
OTAS: An Elastic Transformer Serving System via Token Adaptation: elastic in serving while considering SLO
-
DeltaZip: Multi-Tenant Language Model Serving via Delta Compression: Multi-Tenant is interesting
-
ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models: find different problems in serving LLMs
-
Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access: accepted by EuroSys'23
-
Towards Pareto Optimal Throughput in Small Language Model Serving: Small Language Model Serving
-
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
-
Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services: idea of QoE
-
FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning: how to find novel questions?
-
Deferred Continuous Batching in Resource-Efficient Large Language Model Serving: similar to FlexLLM
-
LLMServingSim: A Simulation Infrastructure for LLM Inference Serving Systems: provide some features about LLM serving
-
Slice-Level Scheduling for High Throughput and Load Balanced LLM Serving: Improvements to ORCA(SLS) and FastServe(ILS)
-
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems: consider serving efficiency from energy view
-
Power-aware Deep Learning Model Serving with μ-Serve: consider energy
-
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming: a new token transmission scheme, useful in chatbot
-
Responsive ML inference in multi-tenanted environments using AQUA: serving several LLMs based on time-sharing GPUs cycles, in multi-tenanted environments
-
Towards SLO-Optimized LLM Serving via Automatic Inference Engine Tuning: effect of hyper-parameters in inference engine
-
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling: request schedule
-
Efficient LLM Scheduling by Learning to Rank: rank request based on output length predict and schedule
-
Responsive ML inference in multi-tenanted environments using AQUA: offload context to other GPUs in multi-tenant environment
-
UELLM: A Unified and Efficient Approach for LLM Inference Serving: serving optimization in MaaS clouds
-
One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving: shcduling the requests
-
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving: harvest stranded GPU resources for offline LLM inference tasks
-
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services: accepted by SC'24
-
Revisiting SLO and Goodput Metrics in LLM Serving: check metrics SLO and Goodput in LLM serving
-
Hops: Fine-grained heterogeneous sensing, efficient and fair Deep Learning cluster scheduling system: schedule tasks in multi-tenant deep learning (DL) cluster, accepted by SoCC'24
-
⭐ Ensuring Fair LLM Serving Amid Diverse Applications: ensures fair LLM access across diverse applications, with a copilot trace analysis
-
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching: exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing
-
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching: similar to blendserve
-
iServe: An Intent-based Serving System for LLMs: use cost model to dynamically set deployment configuration
-
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms: seems a Practical work in engineering? Take into account temperature and power consumption
-
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments: a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments, and fluctuating online conditions
- ⭐ A System for Microserving of LLMs: seems a idea and industrial practice that makes sense
- DeepFlow: Serverless Large Language Model Serving at Scale: provide fine-grained LLM service
- ⭐ Towards Swift Serverless LLM Cold Starts with ParaServe: pipeline parallelism and dynamic adjust parallelism strategy, and accelerate cold-start
- λScale: Enabling Fast Scaling for Serverless Large Language Model Inference: serverless inference system to achieve fast model scaling, by fast model multicast, inference execution during model transmission and dynamically constructs execution pipelines
- Medusa: Accelerating Serverless LLM Inference with Materialization: target at cold-start of LLM serverlesss, to solve the available KV cache blocks profiling and cuda graph capture problems, accepted by ASPLOS'25
- Enabling Elastic Model Serving with MultiWorld: optimizing collective communication lib for LLM inference
- Flexible Scheduling of Network and Computing Resources for Distributed AI Tasks
- AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive: communicating strategy based on runtime, ICDCS'24
- Crux: GPU-Efficient Communication Scheduling for Deep Learning Training: a communication scheduler that aims to maximize GPU computation utilization by mitigating the communication contention among DLT jobs, SIGCOMM'24
- TENPLEX: Changing Resources of Deep Learning Jobs using Parallelizable Tensor Collections: by Luo MAI, similar to SpotServe?
- SpotServe: Serving Generative Large Language Models on Preemptible Instances: by Xupeng MIAO and under guidence of Zhihao JIA
- Parcae: Proactive, Liveput-Optimized DNN Training on Preemptible Instances: by team of SpotServe
- FaPES: Enabling Efficient Elastic Scaling for Serverless Machine Learning Platforms: a FaaS-oriented Performance-aware Elastic Scaling system to enable efficient resource allocation in serverless platforms for ML jobs, accepted by SoCC'24
- Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale: resource allocation at cluster and data center scale
- Compass: A Decentralized Scheduler for Latency-Sensitive ML Workflows: scheduler for latency-sensitive request
- Llumnix: Dynamic Scheduling for Large Language Model Serving: scheduling in multi instances may by helpful for me now
- Arlo: Serving Transformer-based Language Models with Dynamic Input Lengths: solve Dynamic Input Lengths by multi-instance and request scheduling
- Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling: scheduling based on a output length predictor
- Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs: request scheduling in cluster and on instance
- Fast Inference for Augmented Large Language Models: schedule for Augmented LLM
- ALISE: Accelerating Large Language Model Serving with Speculative Scheduling: prediction-based scheduling + memory management + quantization's hodgepodge
- The Effect of Scheduling and Preemption on the Efficiency of LLM Inference Serving: cost model in request scheduling
- Queue Management for SLO-Oriented Large Language Model Serving: schedule for request with differnt models and differnet SLO requirements
- FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving: fairness and request switch
- HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location: request co-location to maximize serving throughput and prevent starvation, without compromising online serving latency
- Locality-aware Fair Scheduling in LLM Serving
- ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition: share prefix and optimize KV Cache
- S-LoRA: Serving Thousands of Concurrent LoRA Adapters: beginninf of Serving for LoRA, under the guidence of Ion Stoica: accepted by MLSys'24
- Dynamic LoRA Serving System for Offline Context Learning: successor of S-LoRA
- CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference: serving LoRA is becoming more and more important
- PUNICA: MULTI-TENANT LORA SERVING: accepted by MLSys'24
- Petals: Collaborative Inference and Fine-tuning of Large Models
- LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design: maybe useful, kernel optimization
- dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving: accepted by OSDI'24
- Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU: optimize SGMV kernels
- V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM: LoRA for vision models, and optimize LoRA kernels, accepted by EuroSys'25
- Efficient Multi-task LLM Quantization and Serving for Multiple LoRA Adapters: facilitates the sharing of a single quantized model for multiple LoRA adapters, accepted by NIPS'24
- Comparative Analysis and Optimization of LoRA Adapter Co-serving for Large Language Models: more like a survey for LoRA serving
- DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs: compress model deltas to serves multiple full-parameter fine-tuned models(maybe not LoRA fine-tune?)
For LoRA but not serving
- ASPEN: High-Throughput LoRA Fine-Tuning of Large Language Models with a Single GPU
- LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin: potential new style of LoRA
- Higher Layers Need More LoRA Experts
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
- FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning: how to find novel questions?
- LoRA Meets Dropout under a Unified Framework: Analyze LoRA algorithmically
- HydraLoRA: An Asymmetric LoRA Architecture for Efficient Fine-Tuning: algorithm optimization for LoRA
- SBoRA: Low-Rank Adaptation with Regional Weight Updates: an algorithm optimization for LoRA
- A Survey on LoRA of Large Language Models: survey of LoRAs, incluing parallel LoRA computing and Multi-LoRA, github
- mLoRA: Fine-Tuning LoRA Adapters via Highly-Efficient Pipeline Parallelism in Multiple GPUs: can study the LoRA-aware pipeline parallelism scheme, github
- MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts: LoRA based MoE, github
- GongBu: Easily Fine-tuning LLMs for Domain-specific Adaptation: LLM fine-tuning tools
- Adapters Selector: Cross-domains and Multi-tasks LoRA Modules Integration Usage Method: select several LoRAs for a content
- SplitLLM: Hierarchical Split Learning for Large Language Model over Wireless Network: split learning(?) train lora weights in wireless network environment, store lora in edge servers?
- Revolutionizing Large Model Fine-Tuning: The Role of LoRA in Parameter-Efficient Adaptation: a survey, can provide some reference
- HyC-LoRA: Memory Efficient LoRA Fine-tuning with \textbf{Hy}brid Activation \textbf{C}ompression: optimize fine-tune memory overhead by quantization, accepted by MLSYS'25
- Deferred Continuous Batching in Resource-Efficient Large Language Model Serving
- Latency-Guaranteed Co-Location of Inference and Training for Reducing Data Center Expenses: place training and inference together, control the inference latency to the desired SLO, while maximizing the throughput of the training jobs co-located on the same GPUs, accepted by ICDCS'24
Long-Context is a hot point recently.
- Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
- Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference: like a update for H2O or Dejevu, et.al, each attention head have different memory budget
- Context Parallelism for Scalable Million-Token Inference
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection: select some important KV cache to take part in attention computation
Process differnet ML loads in a cluster.
- PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters: serve multiple different loads in GPU cluster, accepted by SC'24
- PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption: why Encryption in LLM inference? by IPADS, accepted by ASPLOS'25
- Topology-aware Preemptive Scheduling for Co-located LLM Workloads: schedule different workloads
- ⭐ Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models: retrieval will be helpful, but how to use it?
- Generative Dense Retrieval: Memory Can Be a Burden: accepted by EACL'24
- ⭐ Accelerating Retrieval-Augmented Language Model Serving with Speculation: also a paper for RaLM
- RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation: improve RAG inference with cache, under guidence of Xin JIN
- FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research
- Accelerating Retrieval-Augmented Language Model Serving with Speculation: help understand RaLM
- NinjaLLM: Fast, Scalable and Cost-effective RAG using Amazon SageMaker and AWS Trainium and Inferentia2
- Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting: RAG with spec decoding, different draft models with different RAG
- CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion: optimize KV cache reuse(prefix cache)
- RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation: trade-off between latency and quantity
- Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation: combine RAG with prefix cache
Here are two repositories have some papers for MoE: Papers: MoE/Ensemble, and MOE papers to read
-
⭐ DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale: accepted by ICML'22
-
Accelerating Distributed MoE Training and Inference with Lina: both training and inference, accepted by ATC'23
-
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts: accepted by MLSys'23
-
Tutel: Adaptive Mixture-of-Experts at Scale: accepted by MLSys'23
-
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference: accepted by ISCA'24
-
Optimizing Mixture of Experts using Dynamic Recompilations: under guidence of Zhihao JIA
-
Serving MoE Models on Resource-constrained Edge Devices via Dynamic Expert Swapping: expert swapping is interesting
-
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference: some hot optimizations for inference, accepted by NIPS'24
-
Exploiting Transformer Activation Sparsity with Dynamic Inference
-
SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System
-
Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud Scale Production: accepted by ACL'22
-
Fast Inference of Mixture-of-Experts Language Models with Offloading: combine moe with offloading
-
⭐ MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving: under guidence of Luo MAI, provided some features and design in moe inference
-
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
-
FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement: train MoE with new schedule plan, maybe work for inference
-
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference
-
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models: quantized experts and expers management
-
Toward Inference-optimal Mixture-of-Expert Large Language Models: some analysis for training moe based on inference cost
-
[Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules]: comm optimization in MoE, accepted by InfoCom'24
-
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models: based on offload, accepted by MLSys'24
-
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy: introduce some features of MoE, accepted by ICLR'24
-
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework: introduce some features of MoE too
-
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models: introduction paper
-
Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies: all_to_all comm, HPDC'24
-
Scattered Mixture-of-Experts Implementation: ScatterMoE, an implementation of Sparse MoE
-
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts: the Shortcut-connection looks more like a algorithm optimization, and provide oppotunity for overlapping
-
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model: a opsen-source work and it inferences based expert-parallel
-
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget: MoE experts offloading, at the cost of reduced accuracy
-
ProMoE: Fast MoE-based LLM Serving using Proactive Caching: optimization on Pre-gated MoE, by IPADS
-
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design: pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, NIPS'24
-
MoEsaic: Shared Mixture of Experts: share Expert among different MoE instance, "MoE's modular architecture lets users compose their model from popular off-the-shelf experts" is a new scenario
-
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference: use quantization to decrease uncached MoE load overhead, on edge devices
-
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference: prediction and offload based optimization
-
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs: use offload-pipeline to accelerate inference moe on single GPU
-
⭐ MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems: benchmarking for MoE systems
-
⭐ Lynx: Enabling Efficient MoE Inference through Dynamic Batch-Aware Expert Selection: damn! I had considered this before:( . key insight is that expert importance varies significantly across tokens and inference phases, utilize this to solve the all-activate problem
-
⭐ EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference: Gemm implemention optimization and alltoall communication overlap
-
⭐ Optimizing Mixture-of-Experts Inference Time Combining Model Deployment and Communication Scheduling: optimize all2all order, co-locate experts from different models
-
MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing: utilize the expert dependency to opmizate GPU load balance and alltoall latency
-
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving: fine-grained expert offload, prefetch and cache
-
Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts: fine-grained task schduling and computation-alltoall overlap
-
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs: offload MoE weights to CPU by layers, accepted by ASPLOS'25
-
ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling: scheduling comp and comm in MoE training, perhaps useful for MoE inference. accepted by EuroSys'24
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models: a start work in MoE
-
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models: algorithm change in MoE
-
Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping: Computation-Communication Overlapping, accepted by MLSys'24
-
Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training: training with offload, ICML'24
-
MPMoE: Memory Efficient MoE for Pre-Trained Models With Adaptive Pipeline Parallelism
-
Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules: Dedicated Schedules for MP+EP+ESP MoE training, maybe work for infernece
-
Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing: load is stabilized in the middle and late stages of training, but may not wrok greatly for insference
-
SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Offline and Online Parallelization: parallel strategy of MoE, accepted by ATC'23
-
APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes: fine-tune MoE models with CPU and some algorithm insights, accepted by SC'24
-
Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing: prediction the expert workload to optimize training
-
FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models: There isn't much of a novel technology(?), accepted by ASPLOS'25
- MOSEL: Inference Serving Using Dynamic Modality Selection: improving system throughput by 3.6x with an accuracy guarantee and shortening job completion times by 11x
- Generative AI Beyond LLMs: System Implications of Multi-Modal Generation: by META
- Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations: by Google
- Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference: optimization for diffusion models by cache
- DISTMM: Accelerating distributed multimodal model training: helpful although it is made for training, accepted by NSDI'24
- Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training: distributed MM trainging
- Efficiently serving large multimedia models using EPD Disaggregation
- MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving: position-independent caching, with both reuse and recompute, may lead to performance loss
- Characterizing and Efficiently Accelerating Multimodal Generation Model Inference: some insights
- DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models: disaggregation in MM training, under guidence of Xin JIN
- Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management: efficient MM model training
- Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models: serving Diffusion models, accepted by NSDI'24
- DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines: accepted by MLSys'24
- SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules: more papers in diffusion models
- PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving: algorithm-based framework
What is this? maybe multiple LLM?
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems: a new scenario, by Stanford
- ALTO: An Efficient Network Orchestrator for Compound AI Systems: also new to me, by Stanford
- Proteus: A High-Throughput Inference-Serving System with Accuracy Scaling: accuracy scaling is interesting, accepted by ASPLOS'24
- MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving: multiple LLMs
- ROUTERBENCH: A Benchmark for Multi-LLM Routing System: but what is multi-LLM?
- Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification
- BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference: prompt KV cache reuse, accepted by MLSys'24
- Preble: Efficient Distributed Prompt Scheduling for LLM Serving: similar to BlockLLM?
- Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution: for LLM-based Applications
- Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
- RouteLLM: Learning to Route LLMs with Preference Data: use multiple LLMs for efficient serving
- USHER: Holistic Interference Avoidance for Resource Optimized ML Inference: inference several models simultaneously
- CoServe: Efficient Collaboration-of-Experts (CoE) Model Inference with Limited Memory: a new scenario Collaboration-of-Experts instead of mixture-of-experts, provide some new oppotunities, acceped by ASPLOS'25
- Teola: Towards End-to-End Optimization of LLM-based Applications: endd-to-end optimization
- Parrot: Efficient Serving of LLM-based Applications with Semantic Variable: accepted by OSDI'24
- Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications: many LLM apps share GPU, accepted by EuroSys'24
- Autellix: An Efficient Serving Engine for LLM Agents as General Programs: multi-agent has something similar to LLM application, scheduling and preemption
- Fast Inference for Augmented Large Language Models: seems a subclass of multi-agent
- Characterization of Large Language Model Development in the Datacenter: fault-tolerant serving in the future?
- Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement: Fault Tolerance in MoE training
- Partial Experts Checkpoint: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training: checkpointing in MoE
It is usually related to CPU-GPU heterogeneity and GPU power consumption.
- DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
- Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
- Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving: early exits, accepted by SOSP'24
- Improving DNN Inference Throughput Using Practical, Per-Input Compute Adaptation: early exits and some system optimization, accepted by SOSP'24
Wise men learn by others.
- Orca 2: Teaching Small Language Models How to Reason
- FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference: optimization for retrieval-augmented language model
- Optimizing Dynamic Neural Networks with Brainstorm: this idea has the potential to go further, accepted by OSDI'23
- Ring Attention with Blockwise Transformers for Near-Infinite Context: Ring Attention?
- Reducing Activation Recomputation in Large Transformer Models: by NVIDIA
- Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models: an interesting performance metric, accepted by NIPS'23
- FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication: accepted by SIGMOD'23
- Efficient Multi-GPU Graph Processing with Remote Work Stealing: accepted by ICDE'23
- ARK: GPU-driven Code Execution for Distributed Deep Learning: accepted by NSDI'23
- Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs: accepted by MLSys'22
- Golgi: Performance-Aware, Resource-Efficient Function Scheduling for Serverless Computing: Scheduling for Serverless Computing
- FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters: expand to other ML models instead of LLM
- Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication
- FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing
- Two-Face: Combining Collective and One-Sided Communication for Efficient Distributed SpMM: efficient SpMM, accepted by ASPLOS'24
- GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching: GPU memory pool, accepted by ASPLOS'24
- QuickLLaMA: Query-aware Inference Acceleration for Large Language Models: an inference-friendly LLaMA architecture
- HybridFlow: A Flexible and Efficient RLHF Framework: framework for RLHF
- Marconi: Prefix Caching for the Era of Hybrid LLMs: prefix cache for new model arch like combine attention with SSM
I'd like to create a separate area for data flows. It's just my preference.
- ⭐ FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks: dataflow in inference
- Pathways: Asynchronous Distributed Dataflow for ML: accepted by MLSys'22
- VirtualFlow: Decoupling Deep Learning Models from the Underlying Hardware: accepted by MLSys'22
How about data pre-processing overhead in training?
Just my preference.
- Boosting Distributed Full-graph GNN Training with Asynchronous One-bit Communication
- GNNPipe: Scaling Deep GNN Training with Pipelined Model Parallelism
- PckGNN: Optimizing Aggregation Operators with Packing Strategies in Graph Neural Networks: accepted by IPDPS'24
- NPA: Improving Large-scale Graph Neural Networks with Non-parametric Attention: SIGMOD'24
- Eliminating Data Processing Bottlenecks in GNN Training over Large Graphs via Two-level Feature Compression: compress node features in graph, accepted by VLDB'24
- Mega: More Efficient Graph Attention for GNNs: optimize graph attention efficiency, ICDCS'24
- TORCHGT: A Holistic System for Large-Scale Graph Transformer Training: graph transformer model
Just my preference, too.