The paper introduces a novel approach to optimize memory usage in serving Large Language Models (LLMs) through a method called PagedAttention, inspired by virtual memory and paging techniques in operating systems. This method addresses the significant memory waste in existing systems due to inefficient handling of key-value (KV) cache memory, which is crucial for the performance of LLMs. …