Large language models (LLMs) have demonstrated impressive capabilities across a wide range of applications. However, no single model can optimally address all tasks, especially when considering the trade-off between performance and cost. This has led to the development of LLM routing systems that leverage the strengths of various models.
Despite the growing interest in LLM routing, progress has been hindered by the lack of a standardized benchmark for evaluating router performance. The ROUTERBENCH paper addresses this gap by introducing a comprehensive evaluation framework and dataset.
Key contributions of ROUTERBENCH:
A theoretical framework for assessing router efficiency in terms of cost (in dollars) and performance
Evaluation of various routing strategies, demonstrating the potential for cost savings without sacrificing performance
Various strategies have been proposed to optimize the cost and performance of current LLMs. Here, we provide an overview with a focus on routing-related approaches.
These single-LLM enhancements are usually model and scenario-specific, and may not benefit from the growing number of LLMs.
Beyond single LLM approaches, LLM synthesis utilizes an ensemble of multiple LLMs, integrating their outputs into an enhanced final result (Jiang et al., 2023b). Another approach shows that strategically combining smaller models can match or even outperform larger models (Lu et al., 2024). However, these methods require at least two steps: text generation and synthesis, increasing costs and latency, which creates challenges for production use.
Unlike LLM synthesis, routing can select the suitable model for specific input without performing inference on every candidate model. Routing can be classified into two categories:
Layered inference framework: Re-routes complex queries to an advanced model for improved results (Wang et al., 2023).
Predictive routing: Selects the optimal LLM without evaluating the output. Approaches include:
Predictive routers can bring substantial cost and performance improvements without sacrificing latency, with several early works dedicated to this field.
While many routers exist, a systematic benchmark for their evaluation has been lacking. ROUTERBENCH aims to address this issue and introduce a benchmark for router evaluation.
The paper introduces a mathematical framework to capture the multi-faceted nature of router evaluation. Key components:
The key challenge in evaluating routing systems is balancing the conflicting goals of maximizing performance and minimizing cost. The paper introduces a mathematical framework to address this:
A router R is defined as a function that takes input x and parameters θ and selects the most suitable LLM from set L to complete the prompt.
By experimenting with various router parameters θ1 to θk, we get a series of datapoints (cRθ1, qRθ1), ..., (cRθk, qRθk) that can be plotted in the cost-quality (c-q) plane for comparison to individual LLMs.
Two key operations are introduced:
Linear Interpolation - Computes a weighted average between any two points (routers) in the c-q plane. This interpolation Rint(Rθ1, Rθ2, t) can achieve any cost-performance tradeoff between the two routers based on parameter t.
Extrapolation - Extends a router to cover the full cost domain [0,∞]. Trivially, cost can be added without affecting performance (e.g. run LLM multiple times, use last output). To reduce cost, interpolate with the "null router" of 0 cost and 0 performance. This allows fair comparison between routers.
On the c-q plane with multiple routers, a non-decreasing convex hull can be constructed. This represents the optimal routing strategy - for any target cost, performance is maximized by interpolating the two routers at the vertices of the hull segment intersecting that cost.
The Zero Router is defined as one that selects LLMs based on this non-decreasing convex hull. It provides a simple mathematical baseline to assess if more complex routers provide any benefit.
To compare two routing systems Rθ and Rλ:
Sample their parameters to generate a set of points on the c-q plane
Construct a non-decreasing convex hull for each set, Rfθ and Rfλ on the shared cost domain [cmin, cmax]
This AIQ metric provides a single value to easily compare the overall performance of different routing systems. Outperforming the Zero Router baseline indicates a routing system provides real value.
In summary, this mathematical framework enables systematic evaluation of LLM routing approaches by unifying their cost-quality tradeoffs through interpolation and AIQ. The Zero Router provides a crucial baseline to assess if more sophisticated techniques yield meaningful improvements.
ROUTERBENCH consists of a broad spectrum of tasks relevant to LLM applications:
The dataset was constructed by performing inference with 14 LLMs (11 for non-RAG tasks) on 8 existing datasets widely used for evaluating state-of-the-art models. This ensures ROUTERBENCH is representative of the challenges faced in real-world LLM deployment.
Experiments were conducted to evaluate predictive routers (KNN and MLP-based) and non-predictive routers (cascading) on ROUTERBENCH.
Findings:
ROUTERBENCH establishes a robust benchmark for evaluating LLM routing systems, enabling systematic comparison of different approaches. The results highlight the potential for cost savings through effective routing without compromising performance.
Future directions:
ROUTERBENCH marks an important step towards standardizing the evaluation of LLM routing systems, setting the stage for further advancements in cost-effective and high-performing LLM deployment.
Created 2024-04-04T17:13:55-07:00 · Edit