Break the sequential dependency of llm inference using lookahead decoding

arXiv preprint arXiv:2402.02057

Published On 2024/2/3

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding often require a draft model (e.g., speculative decoding), which is nontrivial to obtain and unable to generalize. In this paper, we introduce Lookahead decoding, an exact, parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores. It allows trading per-step log(FLOPs) to reduce the number of total decoding steps, is more parallelizable on single or multiple modern accelerators, and is compatible with concurrent memory-efficient attention (e.g., FlashAttention). Our implementation of Lookahead decoding can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks. Our code is avialable at https://github.com/hao-ai-lab/LookaheadDecoding

Journal

arXiv preprint arXiv:2402.02057

Authors

Ion Stoica

University of California, Berkeley

H-Index

153

Research Interests

Cloud Computing

Networking

Distributed Systems

Big Data

University Profile Page

University of California, Berkeley

Access Email

Hao Zhang

Carnegie Mellon University

H-Index

Research Interests

Machine Learning

Systems

Computer Vision

University Profile Page

Carnegie Mellon University

Access Email

Other Articles from authors

Hao Zhang

Carnegie Mellon University

arXiv preprint arXiv:2401.09670

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

DistServe improves the performance of large language models (LLMs) serving by disaggregating the prefill and decoding computation. Existing LLM serving systems colocate the two phases and batch the computation of prefill and decoding across all users and requests. We find that this strategy not only leads to strong prefill-decoding interferences but also couples the resource allocation and parallelism plans for both phases. LLM applications often emphasize individual latency for each phase: time to first token (TTFT) for the prefill phase and time per output token (TPOT) of each request for the decoding phase. In the presence of stringent latency requirements, existing systems have to prioritize one latency over the other, or over-provision compute resources to meet both. DistServe assigns prefill and decoding computation to different GPUs, hence eliminating prefill-decoding interferences. Given the application's TTFT and TPOT requirements, DistServe co-optimizes the resource allocation and parallelism strategy tailored for each phase. DistServe also places the two phases according to the serving cluster's bandwidth to minimize the communication caused by disaggregation. As a result, DistServe significantly improves LLM serving performance in terms of the maximum rate that can be served within both TTFT and TPOT constraints on each GPU. Our evaluations show that on various popular LLMs, applications, and latency requirements, DistServe can serve 4.48x more requests or 10.2x tighter SLO, compared to state-of-the-art systems, while staying within latency constraints for > 90% of requests.

2024/1/18

Break the sequential dependency of llm inference using lookahead decoding

Authors

Ion Stoica

University of California, Berkeley

Hao Zhang

Carnegie Mellon University

Other Articles from authors

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers

M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Can't Be Late: Optimizing Spot Instance Savings under Deadlines

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

ZKML: An Optimizing System for ML Inference in Zero-Knowledge Proofs

Judging llm-as-a-judge with mt-bench and chatbot arena

MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving

Cloudcast:{High-Throughput},{Cost-Aware} Overlay Multicast in the Cloud

Chatbot arena: An open platform for evaluating llms by human preference

GoEX: Perspectives and Designs Towards a Runtime for Autonomous LLM Applications

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems

Trustless Audits without Revealing Data or Models

Toward Inference-optimal Mixture-of-Expert Large Language Models

APIServe: Efficient API Support for Large-Language Model Inferencing

Chatbot arena: An open platform for evaluating llms by human preference

Judging llm-as-a-judge with mt-bench and chatbot arena

MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving

Other articles from arXiv preprint arXiv:2402.02057 journal

Break the sequential dependency of llm inference using lookahead decoding

Break the sequential dependency of llm inference using lookahead decoding