Atlas Wang

Atlas Wang

Biography

Prof. Dr. Atlas Wang

Atlas Wang is a tenured Associate Professor and holds the Temple Foundation Endowed Faculty Fellowship #7, in the ECE Department at The University of Texas at Austin. He leads the VITA group (https://vita-group.github.io/). Since May 2024, he has been on leave from UT Austin to serve as the full-time Research Director for XTX Markets, heading the new AI Lab in New York City. Dr. Wang’s core research mission is to leverage, understand and expand the role of low dimensionality in ML and optimization, whose impacts span over many important topics such as the efficiency and trust issues in large language models (LLMs) as well as generative vision. He has won numerous research awards, and previously also served as a research director for Picsart (2022-2024) and a visiting researcher in Amazon (2021-2022).

Efficient Generative Inference by Heavy Hitters and Beyond

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in various applications such as dialogue systems and story writing. However, their deployment remains cost-prohibitive, particularly due to the extensive memory requirements associated with long-content generation. This talk will present innovative approaches aimed at reducing the memory footprint and improving the throughput of LLMs, focusing on the management of the KV cache, a key component stored in GPU memory that scales with sequence length and batch size. I will first introduce the Heavy Hitter Oracle (H2O), a novel KV cache eviction policy that significantly reduces memory usage by dynamically retaining a balance of recent and “Heavy Hitter” (H2) tokens—tokens that contribute most to the attention scores. We will discuss how the H2O approach, based on submodular optimization, outperforms existing inference systems. We will then address the compound effects of combining sparsification and quantization techniques on the KV cache, proposing the Q-Hitter framework. Q-Hitter enhances the H2O approach by identifying tokens that are not only pivotal based on accumulated attention scores but also more suitable for low-bit quantization. This dual consideration of attention importance and quantization friendliness leads to substantial memory savings and throughput improvements.  We will conclude the talk by reviewing more recent efforts on efficient LLM inference and the future opportunities.