Ion Stoica

Ion Stoica

University of California, Berkeley

H-index: 153

North America-United States

Professor Information

University

University of California, Berkeley

Position

Professor of Computer Science

Citations(all)

148602

Citations(since 2020)

44618

Cited By

120965

hIndex(all)

153

hIndex(since 2020)

86

i10Index(all)

407

i10Index(since 2020)

296

Email

University Profile Page

University of California, Berkeley

Research & Interests List

Cloud Computing

Networking

Distributed Systems

Big Data

Top articles of Ion Stoica

Break the sequential dependency of llm inference using lookahead decoding

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding often require a draft model (e.g., speculative decoding), which is nontrivial to obtain and unable to generalize. In this paper, we introduce Lookahead decoding, an exact, parallel decoding algorithm that accelerates LLM decoding without needing auxiliary models or data stores. It allows trading per-step log(FLOPs) to reduce the number of total decoding steps, is more parallelizable on single or multiple modern accelerators, and is compatible with concurrent memory-efficient attention (e.g., FlashAttention). Our implementation of Lookahead decoding can speed up autoregressive decoding by up to 1.8x on MT-bench and 4x with strong scaling on multiple GPUs in code completion tasks. Our code is avialable at https://github.com/hao-ai-lab/LookaheadDecoding

Authors

Yichao Fu,Peter Bailis,Ion Stoica,Hao Zhang

Journal

arXiv preprint arXiv:2402.02057

Published Date

2024/2/3

depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers

PyTorch \texttt{2.x} introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, adapting to the PyTorch compiler to full potential can be challenging. The compiler operates at the Python bytecode level, making it appear as an opaque box. To address this, we introduce \texttt{depyf}, a tool designed to demystify the inner workings of the PyTorch compiler. \texttt{depyf} decompiles bytecode generated by PyTorch back into equivalent source code, and establishes connections between in-memory code objects and their on-disk source code counterparts. This feature enables users to step through the source code line by line using debuggers, thus enhancing their understanding of the underlying processes. Notably, \texttt{depyf} is non-intrusive and user-friendly, primarily relying on two convenient context managers for its core functionality. The project is \href{https://github.com/thuml/depyf}{ openly available} and is recognized as a \href{https://pytorch.org/ecosystem/}{PyTorch ecosystem project}.

Authors

Kaichao You,Runsheng Bai,Meng Cao,Jianmin Wang,Ion Stoica,Mingsheng Long

Journal

arXiv preprint arXiv:2403.13839

Published Date

2024/3/14

M\'elange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity

Large language models (LLMs) are increasingly integrated into many online services. However, a major challenge in deploying LLMs is their high cost, due primarily to the use of expensive GPU instances. To address this problem, we find that the significant heterogeneity of GPU types presents an opportunity to increase GPU cost efficiency and reduce deployment costs. The broad and growing market of GPUs creates a diverse option space with varying costs and hardware specifications. Within this space, we show that there is not a linear relationship between GPU cost and performance, and identify three key LLM service characteristics that significantly affect which GPU type is the most cost effective: model request size, request rate, and latency service-level objective (SLO). We then present M\'elange, a framework for navigating the diversity of GPUs and LLM service specifications to derive the most cost-efficient set of GPUs for a given LLM service. We frame the task of GPU selection as a cost-aware bin-packing problem, where GPUs are bins with a capacity and cost, and items are request slices defined by a request size and rate. Upon solution, M\'elange derives the minimal-cost GPU allocation that adheres to a configurable latency SLO. Our evaluations across both real-world and synthetic datasets demonstrate that M\'elange can reduce deployment costs by up to 77% as compared to utilizing only a single GPU type, highlighting the importance of making heterogeneity-aware GPU provisioning decisions for LLM serving. Our source code is publicly available at https://github.com/tyler-griggs/melange-release.

Authors

Tyler Griggs,Xiaoxuan Liu,Jiaxiang Yu,Doyoung Kim,Wei-Lin Chiang,Alvin Cheung,Ion Stoica

Journal

arXiv preprint arXiv:2404.14527

Published Date

2024/4/22

Can't Be Late: Optimizing Spot Instance Savings under Deadlines

Cloud providers offer spot instances alongside on-demand instances to optimize resource utilization. While economically appealing, spot instances’ preemptible nature causes them ill-suited for deadline-sensitive jobs. To allow jobs to meet deadlines while leveraging spot instances, we propose a simple idea: use on-demand instances judiciously as a backup resource. However, due to the unpredictable spot instance availability, determining when to switch between spot and on-demand to minimize cost requires careful policy design. In this paper, we first provide an in-depth characterization of spot instances (eg, availability, pricing, duration), and develop a basic theoretical model to examine the worst and average-case behaviors of baseline policies (eg, greedy). The model serves as a foundation to motivate our design of a simple and effective policy, Uniform Progress, which is parameter-free and requires no …

Authors

Zhanghao Wu,Wei-Lin Chiang,Ziming Mao,Zongheng Yang,Eric Friedman,Scott Shenker,Ion Stoica

Journal

NSDI 2024

Published Date

2024

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and February 2024. We have evaluated 9 base LLMs and 20 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

Authors

Naman Jain,King Han,Alex Gu,Wen-Ding Li,Fanjia Yan,Tianjun Zhang,Sida Wang,Armando Solar-Lezama,Koushik Sen,Ion Stoica

Journal

arXiv preprint arXiv:2403.07974

Published Date

2024/3/12

ZKML: An Optimizing System for ML Inference in Zero-Knowledge Proofs

Machine learning (ML) is increasingly used behind closed systems and APIs to make important decisions. For example, social media uses ML-based recommendation algorithms to decide what to show users, and millions of people pay to use ChatGPT for information every day. Because ML is deployed behind these closed systems, there are increasing calls for transparency, such as releasing model weights. However, these service providers have legitimate reasons not to release this information, including for privacy and trade secrets. To bridge this gap, recent work has proposed using zero-knowledge proofs (specifically a form called ZK-SNARKs) for certifying computation with private models but has only been applied to unrealistically small models. In this work, we present the first framework, ZKML, to produce ZK-SNARKs for realistic ML models, including state-of-the-art vision models, a distilled GPT-2, and the …

Authors

Bing-Jyue Chen,Suppakit Waiwitlikhit,Ion Stoica,Daniel Kang

Published Date

2024/4/22

Cloudcast:{High-Throughput},{Cost-Aware} Overlay Multicast in the Cloud

Bulk data replication across multiple cloud regions and providers is essential for large organizations to support data analytics, disaster recovery, and geo-distributed model serving. However, data multicast in the cloud can be expensive due to network egress costs and slow due to cloud network constraints. In this paper, we study the design of high-throughput, cost-optimized overlay multicast for bulk cloud data replication that exploits trends in modern provider pricing models along with techniques like ephemeral waypoints to minimize cloud networking costs.

Authors

Sarah Wooders,Shu Liu,Paras Jain,Xiangxi Mo,Joseph E Gonzalez,Vincent Liu,Ion Stoica

Published Date

2024

Chatbot arena: An open platform for evaluating llms by human preference

Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{https://chat.lmsys.org}.

Authors

Wei-Lin Chiang,Lianmin Zheng,Ying Sheng,Anastasios Nikolas Angelopoulos,Tianle Li,Dacheng Li,Hao Zhang,Banghua Zhu,Michael Jordan,Joseph E Gonzalez,Ion Stoica

Journal

arXiv preprint arXiv:2403.04132

Published Date

2024/3/7

Professor FAQs

What is Ion Stoica's h-index at University of California, Berkeley?

The h-index of Ion Stoica has been 86 since 2020 and 153 in total.

What are Ion Stoica's research interests?

The research interests of Ion Stoica are: Cloud Computing, Networking, Distributed Systems, Big Data

What is Ion Stoica's total number of citations?

Ion Stoica has 148,602 citations in total.

academic-engine

Useful Links