Biography
Prof. Dr. Atlas Wang
Atlas Wang is a tenured Associate Professor and holds the Temple Foundation Endowed Faculty Fellowship #7, in the ECE Department at The University of Texas at Austin. He leads the VITA group (https://vita-group.github.io/
Efficient Generative Inference by Heavy Hitters and Beyond
Abstract
Large Language Models (LLMs) have demonstrated impressive capabilities in various applications such as dialogue systems and story writing. However, their deployment remains cost-prohibitive, particularly due to the extensive memory requirements associated with long-content generation. This talk will present innovative approaches aimed at reducing the memory footprint and improving the throughput of LLMs, focusing on the management of the KV cache, a key component stored in GPU memory that scales with sequence length and batch size. I will first introduce the Heavy Hitter Oracle (H2O), a novel KV cache eviction policy that significantly reduces memory usage by dynamically retaining a balance of recent and “Heavy Hitter” (H2) tokens—tokens that contribute most to the attention scores. We will discuss how the H2O approach, based on submodular optimization, outperforms existing inference systems. We will then address the compound effects of combining sparsification and quantization techniques on the KV cache, proposing the Q-Hitter framework. Q-Hitter enhances the H2O approach by identifying tokens that are not only pivotal based on accumulated attention scores but also more suitable for low-bit quantization. This dual consideration of attention importance and quantization friendliness leads to substantial memory savings and throughput improvements. We will conclude the talk by reviewing more recent efforts on efficient LLM inference and the future opportunities.