The paper "Let’s Verify Step by Step" from OpenAI presents an insightful exploration into the training of large language models (LLMs) for complex multi-step reasoning tasks. Focusing on mathematical problem-solving, the authors investigate the efficacy of process supervision versus outcome supervision in training more reliable models.
The study is based on two supervision types: Outcome Supervision (ORMs) and Process Supervision (PRMs), using GPT-4 models.
Q: What are Out-of-Distribution (OOD) problems? A: OOD problems occur when a model encounters data during testing that is different from its training data. The model's ability to handle OOD scenarios indicates its robustness and generalizability.
Q: How are outcome-supervised, process-supervised rewards, and majority voting used? A: - ORMs are trained using final results, slightly outperforming majority voting. - PRMs are trained using step-by-step feedback, significantly outperforming ORMs and majority voting. - Majority Voting selects the most common solution among multiple outputs.
Q: What does majority voting mean? A: Majority voting aggregates outputs from multiple model instances and chooses the most frequently produced solution.
Q: How is active learning involved in process supervision? A: Active learning in process supervision involves selectively labeling data points that are most informative, creating a feedback loop between the model's performance and data labeling.
Q: What's the difference between large-scale and small-scale supervision? A: - Large-Scale Supervision: Involves training on substantial data, aiming to advance the state-of-the-art. - Small-Scale Supervision: Focuses on controlled experiments and direct comparisons.
The paper concludes that process supervision is a more effective method for training reliable reward models in multi-step reasoning tasks, like mathematics.
Created 2023-11-26T17:34:36-08:00, updated 2023-12-15T19:15:19-08:00 · History · Edit