Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Introduction:

In today's post, we delve into a recent paper that investigates the intricacies of Reinforcement Learning in the context of Large Language Models (LLMs). This study shines a light on the challenges and nuances of training such models to align better with human preferences.

Background:

Large language models, like ChatGPT, have transformed the NLP landscape. They're pre-trained on massive amounts of text and then fine-tuned for specific tasks. The paper discusses a pipeline, Reinforcement Learning from Human Feedback (RLHF), commonly used for fine-tuning these LLMs.

Preliminaries:

The RLHF pipeline, initially introduced by Ziegler et al., has three main phases: 1. Supervised Fine-Tuning (SFT) 2. Preference Sampling and Reward Learning 3. Reinforcement Learning Optimization

Supervised Fine-Tuning (SFT) Phase:

LLMs start with a generic pre-trained model, which is then fine-tuned on a high-quality dataset for the desired task. This could be a dialogue system, Q/A system, or any other relevant task.

Reward Modelling Phase:

After SFT, the model is prompted to generate pairs of answers for given inputs. These pairs are presented to human evaluators who indicate their preference. The human feedback is used to train a reward model, which can estimate how "good" a particular response is.

Generating Pairwise Comparisons:

For a given prompt \( x \), the model generates two potential responses \( y_1 \) and \( y_2 \). Human evaluators then rank these responses, resulting in feedback of the form \( y_w \succ y_l \) where \( y_w \) is the preferred response and \( y_l \) is the less preferred one.

Modeling Preferences:

The choice of model for \( p^*(y_1 \succ y_2 | x) \) can vary. The paper primarily uses the Bradley-Terry model for this purpose. However, there are other models, like the Thurstone-Mosteller model, which offer different ways of capturing human preferences. The choice of model is crucial as it influences the subsequent RL process. The selection of this model is akin to choosing a hyperparameter; it may depend on the specific application and desired properties of the optimization.

Deep Dive into the Reward Modeling Phase

The reward modeling phase is crucial for understanding the preferences between different responses and for adjusting the model's behavior based on those preferences.

Inputs and Outputs of the Reward Model

The reward model essentially takes in a context and two potential completions (or responses) and outputs the relative rewards for each of those completions. Specifically:

The model generates pairs of answers for a given prompt. Human evaluators then select their preferred answer. The preferences are then modeled using a method like the Bradley-Terry model. The model can be expressed as: \[ p^*(y_1 \succ y_2 | x) = \frac{\exp(r^*(x, y_1))}{\exp(r^*(x, y_1)) + \exp(r^*(x, y_2))} \] Here, \( r^*(x, y) \) represents the latent reward model.

Another crucial formula in this phase is the negative log-likelihood loss: \[ LR(r_\phi, D) = -E_{(x,y_w,y_l) \sim D} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right] \] where \( \sigma \) is the logistic function.

RL Fine-Tuning Phase:

With a reward model in hand, the LLM is further fine-tuned using reinforcement learning. The aim is to make the model produce responses that maximize the expected reward. An important aspect here is the inclusion of a Kullback-Leibler divergence term in the optimization objective, which ensures the updated model doesn't deviate too drastically from the initial model.

A Note on KL Divergence

In the context of reinforcement learning and policy optimization:

\( \text{DKL}(\pi_{\text{old}} || \pi_{\text{new}}) \) tends to penalize new policies that deviate in areas where the old policy has high probability. This encourages more conservative updates to the policy.

\( \text{DKL}(\pi_{\text{new}} || \pi_{\text{old}}) \) penalizes new policies that deviate in areas where the new policy itself places high probability, regardless of the old policy's probabilities. This can allow for more aggressive updates but might be riskier.

In the context of Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), the objective often involves constraining or penalizing \( \text{DKL}(\pi_{\text{old}} || \pi_{\text{new}}) \) to ensure that policy updates are trust-region constrained, i.e., the updates don't stray too far from the old policy. This is done to ensure stability in training.

Given this, for the purpose of ensuring the new policy remains close to the old (reference) policy, the KL term should ideally be \( \text{DKL}(\pi_{\text{ref}}(y|x) || \pi_{\theta}(y|x)) \). However, the choice in the paper might be due to simplifications in optimization or other reasons specific to their approach. In practice, the exact choice can influence the behavior and stability of the training process, and it's often a decision based on empirical results and the desired properties of the optimization.

Direct Preference Optimization (DPO)

The Rationale behind DPO

Reinforcement learning, especially on large models, can be computationally challenging. DPO offers a solution by directly optimizing the model using preference data, bypassing the need for an explicit reward modeling step.

The DPO Objective

Starting from the same RL objective, DPO leverages an analytical mapping from reward functions to optimal policies. The transformation allows the loss function to be expressed over policies instead of rewards. One key equation derived is: \[ r(x, y) = \beta \log \frac{\pi_r(y | x)}{\pi_{\text{ref}}(y | x)} + \beta \log Z(x) \] Here, \( Z(x) \) is a partition function, and \( \beta \) is a hyperparameter controlling deviation from the reference policy.

The paper then reformulates the human preference probability in terms of the optimal policy: \[ p^*(y_1 \succ y_2 | x) = \frac{1}{1 + \exp \left( \beta \log \frac{\pi^*(y_2|x)}{\pi_{\text{ref}}(y_2|x)} - \beta \log \frac{\pi^*(y_1|x)}{\pi_{\text{ref}}(y_1|x)} \right)} \]

The DPO Update Mechanism

Analyzing the gradient of the DPO loss function provides insight into its behavior. The gradient adjusts the model's policy to increase the likelihood of preferred completions and decrease the likelihood of less-preferred ones. Crucially, it weighs examples based on how much the implicit reward model misorders completions.

Conclusion

The paper introduces a fresh perspective on fine-tuning large language models using human feedback. By forgoing traditional RL and introducing Direct Preference Optimization, the authors present a method that's not only simpler but also holds promise in terms of efficiency and results. As LLMs like ChatGPT continue to evolve, methods like DPO could play a pivotal role in shaping their future.

Reference

Related

Created 2023-11-05T19:22:46-08:00, updated 2023-11-05T22:37:12-08:00 · History · Edit