BayJarvis: Blogs on inference

paper LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - 2024-03-26

Large Language Models (LLMs) like ChatGPT have transformed numerous fields by leveraging their extensive reasoning and generalization capabilities. However, as the complexity of prompts increases, with techniques like chain-of-thought (CoT) and in-context learning (ICL) becoming more prevalent, the computational demands skyrocket. This paper introduces LLMLingua, a sophisticated prompt compression method designed to mitigate these challenges. By compressing prompts into a more compact form without significant loss of semantic integrity, LLMLingua enables faster inference and reduced computational costs, promising up to 20x compression rates with minimal performance degradation. …

paper Efficient Memory Management for Large Language Model Serving with PagedAttention - 2024-03-25

The paper introduces a novel approach to optimize memory usage in serving Large Language Models (LLMs) through a method called PagedAttention, inspired by virtual memory and paging techniques in operating systems. This method addresses the significant memory waste in existing systems due to inefficient handling of key-value (KV) cache memory, which is crucial for the performance of LLMs. …

paper From Draft to Target: Optimizing Language Model Decoding with Speculative Sampling - 2023-09-04

In the realm of machine learning, large language models have transformed our capabilities. However, decoding these behemoths efficiently remains a challenge. Enter Speculative Sampling, a technique that promises to revolutionize this decoding process. …