LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Abstract

Large Language Models (LLMs) like ChatGPT have transformed numerous fields by leveraging their extensive reasoning and generalization capabilities. However, as the complexity of prompts increases, with techniques like chain-of-thought (CoT) and in-context learning (ICL) becoming more prevalent, the computational demands skyrocket. This paper introduces LLMLingua, a sophisticated prompt compression method designed to mitigate these challenges. By compressing prompts into a more compact form without significant loss of semantic integrity, LLMLingua enables faster inference and reduced computational costs, promising up to 20x compression rates with minimal performance degradation.

Introduction

The evolution of LLMs, particularly evidenced by the widespread adoption of models like ChatGPT, has necessitated the creation of longer and more intricate prompts to elicit domain-specific knowledge effectively. This, in turn, has escalated computational demands, posing a dilemma between the desire for detailed prompts and the necessity for computational efficiency. Existing solutions primarily focus on model parameter modification, which might not be viable for LLMs accessible only via APIs. LLMLingua addresses these challenges by proposing a prompt compression mechanism that maintains the essential information while significantly reducing prompt length.

LLMLingua: Core Components and Methodology

LLMLingua comprises several innovative components that work in tandem to achieve high-fidelity prompt compression:

1. Budget Controller

This component dynamically allocates different compression ratios to various parts of the prompt (instruction, demonstrations, questions) while ensuring semantic integrity is preserved, even at high compression rates. It leverages demonstration-level compression to eliminate redundancy effectively.

2. Iterative Token-level Compression

Building on the concept that removing tokens with lower perplexity impacts the LLM's comprehension minimally, this algorithm iteratively compresses the prompt at a granular level, considering the interdependence between tokens to retain critical information.

3. Distribution Alignment

To address the potential discrepancies between the small language model used for compression and the target LLM, LLMLingua employs an instruction tuning-based method. This ensures the compressed prompts are aligned with the LLM's understanding, further enhancing performance.

Evaluation and Results

Extensive experiments conducted across diverse datasets, including GSM8K, BBH, ShareGPT, and Arxiv-March23, demonstrate LLMLingua's efficacy. Notably, it achieves state-of-the-art performance, ensuring up to 20x compression with negligible impact on the LLM's output quality. This not only showcases the potential for computational savings but also opens up new possibilities for handling more extensive and complex prompts.

Conclusion and Future Directions

LLMLingua marks a significant advancement in the realm of LLM efficiency, offering a viable solution to the challenges posed by increasingly lengthy prompts. By ensuring rapid inference without compromising semantic richness, it paves the way for more sustainable and versatile applications of LLMs. Future work might explore further optimizations and the potential integration of LLMLingua's compression mechanism with other efficiency-enhancing techniques.

References

Related

Created 2024-03-26T06:20:43-07:00, updated 2024-03-31T11:20:54-07:00 · History · Edit