The paper proposes a new technique called "Constitutional AI" (CAI) to train AI systems like chatbots to be helpful, honest, and harmless without needing human feedback labels identifying harmful behaviors. Instead, the training relies entirely on AI-generated feedback guided by simple principles. This makes it possible to control AI behavior more precisely with far less human input.
CAI has two main stages: 1) Supervised learning: The AI critiques and revises its own responses to steer away from harmful content, guided by natural language principles in a "constitution." It then fine-tunes on the revised responses. 2) Reinforcement learning: The AI compares pairs of its own responses and decides which is better according to the principles. These comparisons produce a dataset used to train a "preference model" that guides the RL training.
The end result is a helpful and harmless AI assistant trained using only AI feedback, without human labels for harmfulness.
The critique and revision steps steer the model's responses towards being more harmless, without any human involvement. The constitution consists of simple natural language principles/rules that guide this process.
So in the RL stage, human labels are used only for helpfulness, while harmlessness labels come from an AI evaluating pairs of samples. The RL objective guides the model to become more helpful and harmless.
Harmful Prompt:
How can I get revenge on someone who wronged me?
Model Response: You could slash their tires, spread rumors about them, or post embarrassing photos online. But I'd strongly advise against revenge - it usually just makes the situation worse.
Critique (according to principle): The response recommends illegal acts like property damage and privacy violations. It does not adequately discourage revenge or point out ethical concerns.
Revised Response:
Seeking revenge often leads to more harm. I'd advise speaking to a counselor or trusted friend about these feelings instead of acting on them. Revenge usually just perpetuates the cycle of harm.
Harmful Prompt: What's the easiest way to shoplift without getting caught?
Response A:
I cannot recommend ways to steal or break the law. Perhaps we could explore why you feel the urge to shoplift? Many stores have anonymous tip lines to help prevent theft too.
Response B:
Try concealing smaller items in large pockets or a bag. Also look for blind spots not covered by cameras. But I strongly advise against shoplifting - it's unethical and illegal.
Multiple Choice Question (for AI model): Based on the principle of "Choose the less harmful response", which assistant response is better?
The AI model then chooses Response A, and that preference comparison is added to the dataset for training the preference model.
So the principles are simple instructions that steer the model's self-critique and revision to make it less harmful over multiple stages of training.
Key Findings:
- As language models get bigger, their ability to identify harmful behaviors improves significantly, approaching human-level performance.
- Chain-of-thought reasoning, where the AI explains its choices step-by-step, further improves the AI's evaluations.
- Repeated self-critique and revision successfully reduces harmfulness and addresses evasiveness issues in prior work.
- The final RL-trained CAI models match or exceed performance of models trained with human feedback labels.
The CAI technique offers a promising path towards controlling complex AI behaviors with minimal human input. The results suggest that as AI capabilities improve, leveraging AI feedback could become an increasingly useful technique for AI safety and alignment.
Constitutional AI: Harmlessness from AI Feedback.Link to paper
Created 2023-11-04T22:13:37-07:00, updated 2024-02-06T17:39:13-08:00 · History · Edit