In-Context Learning for Extreme Multi-Label Classification

Multi-label classification problems with thousands of possible classes are extremely challenging, especially when using in-context learning with large language models (LLMs). Demonstrating every possible class in the prompt is infeasible, and LLMs may lack the knowledge to precisely assign the correct labels.

To tackle this problem, researchers from Ghent University and Stanford University have proposed a novel approach called Infer-Retrieve-Rank (IReRa). This multi-step program leverages interactions between LLMs and retrievers to efficiently handle extreme multi-label classification (XMC) tasks.

How It Works

IReRa works in three key steps:

Infer: An LLM processes the input document and predicts a set of applicable query terms.
Retrieve: A retriever maps each predicted query term to the actual label space.
Rank: Another LLM re-ranks the retrieved labels to produce the final output.

Crucially, the retriever and LLMs are kept frozen. This allows a frozen retriever to be made more flexible by having the LM learn in-context how to predict relevant queries and interpret the retrieval results.

The underlying models, retriever, and prompts are treated as hyperparameters that can be automatically tuned or easily configured. With only 10 unlabeled training examples and around 50 labeled validation examples, a zero-shot teacher LM with a minimal prompt can bootstrap a few-shot prompt for the two LM components.

Example: Job Posting Classification

Let's consider the task of classifying job postings based on the required skills using the ESCO (European Skills, Competences, and Occupations) taxonomy. We'll go through the three steps described in the paper.

Step 1: Infer-Retrieve-Rank Program

Infer: An LLM (e.g., Llama-2) processes the input job posting and predicts relevant skill-related query terms, such as "Python," "machine learning," and "agile development."
Retrieve: The all-mpnet-base-v2 retriever maps these query terms to the most similar ESCO skills. For example, "Python" might map to "Python (programming language)," "machine learning" to "Machine learning," and "agile development" to "Agile software development."
Rank: Another LLM (e.g., GPT-4) takes the retrieved ESCO skills and the original job posting and ranks the skills based on their prominence in the job requirements.

Step 2: Prompt Bootstrapping

The authors use a small number of unlabeled job postings (e.g., 10) as the training set and a slightly larger set of labeled job postings (e.g., 50) as the validation set.
They define a minimal seed prompt for each component (Infer and Rank) using the DSPy Signature abstraction. This prompt describes the input and output fields and provides a basic task description.
A zero-shot teacher LM (e.g., GPT-3.5) is used with the seed prompt to generate outputs for the unlabeled training examples. These outputs serve as pseudo-labels for bootstrapping.
The generated outputs are used to create a few-shot prompt for the student LMs (Llama-2 for Infer and GPT-4 for Rank). The few-shot prompt includes the input-output pairs from the training set.

Step 3: Prompt Optimization

The authors use the DSPy programming model to define the optimization procedure for the few-shot prompts.
They instantiate the IReRa program with the student LMs and the all-mpnet-base-v2 retriever, using the few-shot prompts from Step 2.
The program is executed on the labeled validation set, and the performance is measured using the Reciprocal Rank (RR) metric.
The DSPy optimizer tunes the prompts by selecting the most informative examples from the training set based on the validation performance. This process is repeated for a fixed number of iterations or until convergence.
The optimized prompts are used to instantiate the final IReRa program, which can then be applied to new, unseen job postings.

By following these steps, the IReRa approach learns to effectively use the frozen LLMs and retriever for the job posting classification task. The prompt bootstrapping and optimization process helps in adapting the approach to the specific dataset and taxonomy, without the need for fine-tuning the underlying models.

The same process can be applied to other XMC datasets, such as the biomedical literature example, by defining appropriate seed prompts and using domain-specific retrievers like BioLORD.

The key innovation of the IReRa approach lies in the combination of in-context learning, retrievers, and prompt optimization to tackle extreme multi-label classification problems in a data-efficient and computationally lightweight manner.

Teacher and Student Models

The Infer-Retrieve-Rank approach utilizes two types of language models: a teacher model and two student models.

Teacher Model

The teacher model is a large, pre-trained language model (e.g., GPT-3.5) used only in the prompt bootstrapping step (Step 2). Its role is to generate initial outputs for unlabeled examples, which are then used to create few-shot prompts for the student models. The teacher model is not part of the final IReRa program and is only used to bootstrap the prompts.

Student Models

The student models are the actual language models used in the IReRa program. There are two student models:

Infer Student Model: This model (e.g., Llama-2) is used in the Infer component to generate relevant query terms based on the input text.
Rank Student Model: This model (e.g., GPT-4) is used in the Rank component to produce the final ranked labels based on the retrieved labels and the input text.

The student models are typically smaller and more efficient than the teacher model, making them more suitable for real-world deployment.

Interaction between Teacher and Student Models

In Step 2 (Prompt Bootstrapping), the teacher model generates initial outputs for unlabeled examples. These outputs are used to create few-shot prompts for the respective student models (Infer and Rank). The few-shot prompts contain input-output pairs from the unlabeled examples, which serve as demonstrations for the student models.

In Step 3 (Prompt Optimization), the student models are used with the few-shot prompts created in Step 2. The prompts are optimized to improve the performance of the student models on the specific task and dataset. This optimization is done sequentially:

First, the Infer student model is optimized to generate more accurate and relevant query terms.
Then, using the optimized Infer student model, the Rank student model is optimized to produce better-ranked labels.

The sequential optimization allows the Rank student model to benefit from the improved query terms generated by the optimized Infer student model.

By using the teacher model for prompt bootstrapping and the student models for the actual IReRa program, the approach achieves a balance between effectiveness and efficiency. The teacher model helps create informative few-shot prompts, while the student models provide a more lightweight and deployable solution for extreme multi-label classification tasks.

Impressive Results

The researchers evaluated IReRa on four XMC datasets spanning biomedical and human resources domains. Using a Llama-2 model to infer, an off-the-shelf retriever, and GPT-4 to rank, IReRa achieved state-of-the-art results on three datasets without any finetuning and using orders of magnitude less data compared to specialized systems.

Adapting IReRa to a new dataset can be as simple as writing a minimal zero-shot prompt, configuring which LMs to use, and running the optimization procedure. The optimization happens automatically and can be done in just tens of minutes.

Key Takeaways

This work demonstrates that efficient in-context learning programs can achieve impressive performance on challenging XMC tasks. By declaratively specifying the program logic and optimization procedure, IReRa can be seamlessly applied to new datasets with minimal prompt engineering.

As LLMs continue to grow in capability, approaches like IReRa that flexibly combine their strengths with retrievers offer an exciting path forward. We may be moving towards a future where prompt and pipeline engineering become less brittle, and optimized modular programs can serve as highly effective general-purpose solutions.

Reference

Created 2024-03-13T13:03:04-07:00, updated 2024-03-14T15:12:00-07:00