ROUTERBENCH: A Benchmark for Multi-LLM Routing System

Large language models (LLMs) have demonstrated impressive capabilities across a wide range of applications. However, no single model can optimally address all tasks, especially when considering the trade-off between performance and cost. This has led to the development of LLM routing systems that leverage the strengths of various models.

The Need for a Standardized Benchmark

Despite the growing interest in LLM routing, progress has been hindered by the lack of a standardized benchmark for evaluating router performance. The ROUTERBENCH paper addresses this gap by introducing a comprehensive evaluation framework and dataset.

Key contributions of ROUTERBENCH:

A diverse benchmark dataset covering major LLM tasks, including both open-source and proprietary models
A theoretical framework for assessing router efficiency in terms of cost (in dollars) and performance
Evaluation of various routing strategies, demonstrating the potential for cost savings without sacrificing performance

Related Work

Various strategies have been proposed to optimize the cost and performance of current LLMs. Here, we provide an overview with a focus on routing-related approaches.

Single LLM Enhancement

Fine-tuning: Used to improve models for specific tasks, requiring additional training and domain-specific data (Rafailov et al., 2023).
Prompting mechanisms: Chain-of-Thought (CoT) (Wei et al., 2022; Zhou et al., 2023; Wang et al., 2022) and Tree of Thoughts (ToT) (Yao et al., 2023) can bolster LLM performance without additional fine-tuning.
Mixture-of-Experts (MoE): Explores routing within the model to enhance performance efficiently, using specialized "experts" and routing input to the best expert (Eigen et al., 2014; Shazeer et al., 2017; Fedus et al., 2022; Du et al., 2022; Shen et al., 2023; Si et al., 2023).

These single-LLM enhancements are usually model and scenario-specific, and may not benefit from the growing number of LLMs.

LLM Synthesis

Beyond single LLM approaches, LLM synthesis utilizes an ensemble of multiple LLMs, integrating their outputs into an enhanced final result (Jiang et al., 2023b). Another approach shows that strategically combining smaller models can match or even outperform larger models (Lu et al., 2024). However, these methods require at least two steps: text generation and synthesis, increasing costs and latency, which creates challenges for production use.

Routing

Unlike LLM synthesis, routing can select the suitable model for specific input without performing inference on every candidate model. Routing can be classified into two categories:

Non-predictive routing: Retrieves outputs from LLMs and directly picks one without a model-assisted synthesis step. Examples include:
FrugalGPT (Chen et al., 2023): Employs a generation judger to assess response quality from various LLMs, invoking them sequentially until a predefined quality threshold is met.
Systems integrating small language models with LLMs (Madaan et al., 2023; Yue et al., 2023; Lee et al., 2023).
Layered inference framework: Re-routes complex queries to an advanced model for improved results (Wang et al., 2023).
Predictive routing: Selects the optimal LLM without evaluating the output. Approaches include:
Routers utilizing supervised learning algorithms (Shnitzer et al., 2023).
Reward model-based techniques (Hari & Thomson, 2023; Lu et al., 2023).
Meta-model trained on inputs and model-specific tokens to predict performance scores (Sakota et al., 2023).

Predictive routers can bring substantial cost and performance improvements without sacrificing latency, with several early works dedicated to this field.

While many routers exist, a systematic benchmark for their evaluation has been lacking. ROUTERBENCH aims to address this issue and introduce a benchmark for router evaluation.

Mathematical Formulation for Router Evaluation

The paper introduces a mathematical framework to capture the multi-faceted nature of router evaluation. Key components:

Formulation of cost and performance metrics for individual LLMs
Linear interpolation to achieve cost-performance trade-offs between routers
Non-decreasing convex hull to identify optimal routing strategies
AIQ (Average Improvement in Quality) metric for comparing routing systems

The key challenge in evaluating routing systems is balancing the conflicting goals of maximizing performance and minimizing cost. The paper introduces a mathematical framework to address this:

Consider a set of LLMs L = {LLM1, ..., LLMm} and a dataset D = {x1, ..., x|D|}
For each LLMj and input xi, generate output oji = LLMj(xi)
Each output oji has an associated cost c(oji) and quality/performance q(oji)
This establishes an expected cost cm and quality qm for each LLMm over dataset D

A router R is defined as a function that takes input x and parameters θ and selects the most suitable LLM from set L to complete the prompt.

By experimenting with various router parameters θ1 to θk, we get a series of datapoints (cRθ1, qRθ1), ..., (cRθk, qRθk) that can be plotted in the cost-quality (c-q) plane for comparison to individual LLMs.

Two key operations are introduced:

Linear Interpolation - Computes a weighted average between any two points (routers) in the c-q plane. This interpolation Rint(Rθ1, Rθ2, t) can achieve any cost-performance tradeoff between the two routers based on parameter t.
Extrapolation - Extends a router to cover the full cost domain [0,∞]. Trivially, cost can be added without affecting performance (e.g. run LLM multiple times, use last output). To reduce cost, interpolate with the "null router" of 0 cost and 0 performance. This allows fair comparison between routers.

On the c-q plane with multiple routers, a non-decreasing convex hull can be constructed. This represents the optimal routing strategy - for any target cost, performance is maximized by interpolating the two routers at the vertices of the hull segment intersecting that cost.

The Zero Router is defined as one that selects LLMs based on this non-decreasing convex hull. It provides a simple mathematical baseline to assess if more complex routers provide any benefit.

To compare two routing systems Rθ and Rλ:

Sample their parameters to generate a set of points on the c-q plane
Construct a non-decreasing convex hull for each set, Rfθ and Rfλ on the shared cost domain [cmin, cmax]
Calculate the AIQ (Average Improvement in Quality) metric for each by integrating the convex hull over the cost range and normalizing

This AIQ metric provides a single value to easily compare the overall performance of different routing systems. Outperforming the Zero Router baseline indicates a routing system provides real value.

In summary, this mathematical framework enables systematic evaluation of LLM routing approaches by unifying their cost-quality tradeoffs through interpolation and AIQ. The Zero Router provides a crucial baseline to assess if more sophisticated techniques yield meaningful improvements.

Benchmark Construction

ROUTERBENCH consists of a broad spectrum of tasks relevant to LLM applications:

Commonsense reasoning
Knowledge-based language understanding
Conversation
Math
Coding
Retrieval-augmented generation (RAG)

The dataset was constructed by performing inference with 14 LLMs (11 for non-RAG tasks) on 8 existing datasets widely used for evaluating state-of-the-art models. This ensures ROUTERBENCH is representative of the challenges faced in real-world LLM deployment.

Experimental Results

Experiments were conducted to evaluate predictive routers (KNN and MLP-based) and non-predictive routers (cascading) on ROUTERBENCH.

Findings:

Simple routing algorithms can achieve performance comparable to the best individual LLMs at lower costs
Cascading routers with low error rates significantly outperform both individual models and the Zero router baseline
Routers effectively identify task-specific features (e.g. time-sensitivity) to select appropriate models
Substantial room for improvement remains compared to the Oracle router

Implications and Future Work

ROUTERBENCH establishes a robust benchmark for evaluating LLM routing systems, enabling systematic comparison of different approaches. The results highlight the potential for cost savings through effective routing without compromising performance.

Future directions:

Incorporating additional evaluation criteria (latency, throughput, etc.)
Expanding dataset coverage to include more tasks and newer LLMs
Exploring advanced router designs and prompt optimization techniques
Addressing challenges in retrieval-augmented generation (RAG) contexts

ROUTERBENCH marks an important step towards standardizing the evaluation of LLM routing systems, setting the stage for further advancements in cost-effective and high-performing LLM deployment.

References

Created 2024-04-04T17:13:55-07:00