FrugalGPT: Making Large Language Models Affordable and Efficient

Large Language Models (LLMs) like GPT-4, ChatGPT, and J1-Jumbo have revolutionized natural language processing, enabling unprecedented performance on a wide range of tasks. However, the high cost of querying these LLM APIs is a major barrier to their widespread adoption, especially for high-throughput applications.

A recent paper titled "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance" by Lingjiao Chen, Matei Zaharia, and James Zou from Stanford University addresses this challenge. They propose strategies to substantially reduce the inference cost of using LLMs while maintaining or even improving accuracy.

The High Cost of LLMs

The authors start by analyzing the pricing structures of popular LLM APIs and find that costs can differ by over two orders of magnitude. For example, processing 10M input tokens costs $30 with GPT-4 but only $0.2 with GPT-J. This heterogeneity in pricing, combined with the high absolute costs, makes it expensive to use LLMs at scale.

Strategies for Cost Reduction

To tackle this, the paper outlines three key strategies:

  1. Prompt Adaptation: Reducing prompt sizes to lower the per-query cost. Techniques include prompt selection (using a subset of in-context examples) and query concatenation (processing multiple queries with one prompt).

  2. LLM Approximation: Approximating expensive LLMs with cheaper models or infrastructures. For example, caching an LLM's outputs and reusing them for similar queries, or fine-tuning a smaller model on an expensive LLM's outputs.

  3. LLM Cascade: Adaptively selecting which LLMs to query based on the input. A scoring function assesses if an LLM's output is reliable; if not, the query is passed to the next LLM in the cascade. This leverages the strengths of different LLMs.

Training the Scoring Function

The scoring function is a critical component of the LLM cascade strategy. It is denoted by g(·, ·) : Q × A → \[0, 1\], and it generates a reliability score given a query q and an answer a produced by an LLM API.

The scoring function is obtained by training a simple regression model that learns whether a generation is correct based on the query and the generated answer. The training process involves the following steps:

  1. Collect a dataset of queries, generated answers from various LLMs, and their corresponding correctness labels.

  2. Train a regression model (e.g., DistilBERT) on this dataset, using the query and generated answer as input features and the correctness label as the target variable.

  3. The trained model can then be used as the scoring function g(·, ·) to assess the reliability of an LLM's response for a given query.

Learning the optimal list of LLMs to include in the cascade (denoted by L) and their corresponding threshold values (denoted by τ) is modeled as a constrained optimization problem. The objective is to maximize the expected reward (i.e., the quality of the final answer) while keeping the average cost below a budget b.

This optimization problem is inherently a mixed-integer program and thus computationally expensive to solve. To address this issue, the authors develop a specialized optimizer that prunes the search space of L by ignoring any list of LLMs with small answer disagreement and approximates the objective by interpolating it within a few samples. This results in an efficient implementation with satisfactory performance.

FrugalGPT: Promising Results

As a proof of concept, the authors implement FrugalGPT, a system that uses LLM cascade to intelligently combine GPT-4, ChatGPT, GPT-3 and other LLMs. On several datasets spanning different domains, FrugalGPT matches the accuracy of the best individual LLM while reducing costs by up to 98%, or improves accuracy by up to 4% under the same cost.

These impressive results stem from the fact that even cheap LLMs can sometimes correctly answer queries that stump more powerful models. By learning which LLM to use for each query, FrugalGPT harnesses this diversity to boost efficiency.

The Road Ahead

While FrugalGPT demonstrates the potential of the proposed techniques, the authors emphasize that this is just the tip of the iceberg. Avenues for future work include combining strategies (e.g., joint prompt and LLM selection), considering additional factors like latency and fairness, and collaborations between LLM users and providers to mitigate environmental impact.

As LLMs continue to advance, it's crucial that we develop methods to use them sustainably and efficiently. FrugalGPT is an exciting step in this direction, paving the way for more cost-effective and environmentally friendly adoption of these powerful tools.

References

Related

Created 2024-04-04T22:10:21-07:00, updated 2024-04-05T10:25:41-07:00 · History · Edit