Multi-label classification problems with thousands of possible classes are extremely challenging, especially when using in-context learning with large language models (LLMs). Demonstrating every possible class in the prompt is infeasible, and LLMs may lack the knowledge to precisely assign the correct labels.
To tackle this problem, researchers from Ghent University and Stanford University have proposed a novel approach called Infer-Retrieve-Rank (IReRa). This multi-step program leverages interactions between LLMs and retrievers to efficiently handle extreme multi-label classification (XMC) tasks.
IReRa works in three key steps:
Crucially, the retriever and LLMs are kept frozen. This allows a frozen retriever to be made more flexible by having the LM learn in-context how to predict relevant queries and interpret the retrieval results.
The underlying models, retriever, and prompts are treated as hyperparameters that can be automatically tuned or easily configured. With only 10 unlabeled training examples and around 50 labeled validation examples, a zero-shot teacher LM with a minimal prompt can bootstrap a few-shot prompt for the two LM components.
Let's consider the task of classifying job postings based on the required skills using the ESCO (European Skills, Competences, and Occupations) taxonomy. We'll go through the three steps described in the paper.
Step 1: Infer-Retrieve-Rank Program
Infer: An LLM (e.g., Llama-2) processes the input job posting and predicts relevant skill-related query terms, such as "Python," "machine learning," and "agile development."
Retrieve: The all-mpnet-base-v2 retriever maps these query terms to the most similar ESCO skills. For example, "Python" might map to "Python (programming language)," "machine learning" to "Machine learning," and "agile development" to "Agile software development."
Rank: Another LLM (e.g., GPT-4) takes the retrieved ESCO skills and the original job posting and ranks the skills based on their prominence in the job requirements.
Step 2: Prompt Bootstrapping
The authors use a small number of unlabeled job postings (e.g., 10) as the training set and a slightly larger set of labeled job postings (e.g., 50) as the validation set.
They define a minimal seed prompt for each component (Infer and Rank) using the DSPy Signature abstraction. This prompt describes the input and output fields and provides a basic task description.
A zero-shot teacher LM (e.g., GPT-3.5) is used with the seed prompt to generate outputs for the unlabeled training examples. These outputs serve as pseudo-labels for bootstrapping.
The generated outputs are used to create a few-shot prompt for the student LMs (Llama-2 for Infer and GPT-4 for Rank). The few-shot prompt includes the input-output pairs from the training set.
Step 3: Prompt Optimization
The authors use the DSPy programming model to define the optimization procedure for the few-shot prompts.
They instantiate the IReRa program with the student LMs and the all-mpnet-base-v2 retriever, using the few-shot prompts from Step 2.
The program is executed on the labeled validation set, and the performance is measured using the Reciprocal Rank (RR) metric.
The DSPy optimizer tunes the prompts by selecting the most informative examples from the training set based on the validation performance. This process is repeated for a fixed number of iterations or until convergence.
The optimized prompts are used to instantiate the final IReRa program, which can then be applied to new, unseen job postings.
By following these steps, the IReRa approach learns to effectively use the frozen LLMs and retriever for the job posting classification task. The prompt bootstrapping and optimization process helps in adapting the approach to the specific dataset and taxonomy, without the need for fine-tuning the underlying models.
The same process can be applied to other XMC datasets, such as the biomedical literature example, by defining appropriate seed prompts and using domain-specific retrievers like BioLORD.
The key innovation of the IReRa approach lies in the combination of in-context learning, retrievers, and prompt optimization to tackle extreme multi-label classification problems in a data-efficient and computationally lightweight manner.
The Infer-Retrieve-Rank approach utilizes two types of language models: a teacher model and two student models.
The teacher model is a large, pre-trained language model (e.g., GPT-3.5) used only in the prompt bootstrapping step (Step 2). Its role is to generate initial outputs for unlabeled examples, which are then used to create few-shot prompts for the student models. The teacher model is not part of the final IReRa program and is only used to bootstrap the prompts.
The student models are the actual language models used in the IReRa program. There are two student models:
The student models are typically smaller and more efficient than the teacher model, making them more suitable for real-world deployment.
In Step 2 (Prompt Bootstrapping), the teacher model generates initial outputs for unlabeled examples. These outputs are used to create few-shot prompts for the respective student models (Infer and Rank). The few-shot prompts contain input-output pairs from the unlabeled examples, which serve as demonstrations for the student models.
In Step 3 (Prompt Optimization), the student models are used with the few-shot prompts created in Step 2. The prompts are optimized to improve the performance of the student models on the specific task and dataset. This optimization is done sequentially:
The sequential optimization allows the Rank student model to benefit from the improved query terms generated by the optimized Infer student model.
By using the teacher model for prompt bootstrapping and the student models for the actual IReRa program, the approach achieves a balance between effectiveness and efficiency. The teacher model helps create informative few-shot prompts, while the student models provide a more lightweight and deployable solution for extreme multi-label classification tasks.
The researchers evaluated IReRa on four XMC datasets spanning biomedical and human resources domains. Using a Llama-2 model to infer, an off-the-shelf retriever, and GPT-4 to rank, IReRa achieved state-of-the-art results on three datasets without any finetuning and using orders of magnitude less data compared to specialized systems.
Adapting IReRa to a new dataset can be as simple as writing a minimal zero-shot prompt, configuring which LMs to use, and running the optimization procedure. The optimization happens automatically and can be done in just tens of minutes.
This work demonstrates that efficient in-context learning programs can achieve impressive performance on challenging XMC tasks. By declaratively specifying the program logic and optimization procedure, IReRa can be seamlessly applied to new datasets with minimal prompt engineering.
As LLMs continue to grow in capability, approaches like IReRa that flexibly combine their strengths with retrievers offer an exciting path forward. We may be moving towards a future where prompt and pipeline engineering become less brittle, and optimized modular programs can serve as highly effective general-purpose solutions.
Created 2024-03-13T13:03:04-07:00, updated 2024-03-14T15:12:00-07:00 · History · Edit