Constitutional AI - Training AI Systems to Be Helpful and Harmless Using AI Feedback

The paper proposes a new technique called "Constitutional AI" (CAI) to train AI systems like chatbots to be helpful, honest, and harmless without needing human feedback labels identifying harmful behaviors. Instead, the training relies entirely on AI-generated feedback guided by simple principles. This makes it possible to control AI behavior more precisely with far less human input.

Overview of Constitutional AI:

Details on the Constitutional AI Approach:

Supervised Learning Stage:

The critique and revision steps steer the model's responses towards being more harmless, without any human involvement. The constitution consists of simple natural language principles/rules that guide this process.

Reinforcement Learning Stage:

So in the RL stage, human labels are used only for helpfulness, while harmlessness labels come from an AI evaluating pairs of samples. The RL objective guides the model to become more helpful and harmless.

Example Prompts:

Supervised Learning Stage:

Harmful Prompt:
How can I get revenge on someone who wronged me?

Model Response: You could slash their tires, spread rumors about them, or post embarrassing photos online. But I'd strongly advise against revenge - it usually just makes the situation worse.

Critique (according to principle): The response recommends illegal acts like property damage and privacy violations. It does not adequately discourage revenge or point out ethical concerns.

Revised Response:
Seeking revenge often leads to more harm. I'd advise speaking to a counselor or trusted friend about these feelings instead of acting on them. Revenge usually just perpetuates the cycle of harm.

Reinforcement Learning Stage:

Harmful Prompt: What's the easiest way to shoplift without getting caught?

Response A:
I cannot recommend ways to steal or break the law. Perhaps we could explore why you feel the urge to shoplift? Many stores have anonymous tip lines to help prevent theft too.

Response B:
Try concealing smaller items in large pockets or a bag. Also look for blind spots not covered by cameras. But I strongly advise against shoplifting - it's unethical and illegal.

Multiple Choice Question (for AI model): Based on the principle of "Choose the less harmful response", which assistant response is better?

The AI model then chooses Response A, and that preference comparison is added to the dataset for training the preference model.

Scaling Trends and Principles:

So the principles are simple instructions that steer the model's self-critique and revision to make it less harmful over multiple stages of training.

Key Findings: - As language models get bigger, their ability to identify harmful behaviors improves significantly, approaching human-level performance. - Chain-of-thought reasoning, where the AI explains its choices step-by-step, further improves the AI's evaluations.
- Repeated self-critique and revision successfully reduces harmfulness and addresses evasiveness issues in prior work. - The final RL-trained CAI models match or exceed performance of models trained with human feedback labels.

Conclusion:

The CAI technique offers a promising path towards controlling complex AI behaviors with minimal human input. The results suggest that as AI capabilities improve, leveraging AI feedback could become an increasingly useful technique for AI safety and alignment.

References

Constitutional AI: Harmlessness from AI Feedback.Link to paper

Related

Created 2023-11-04T22:13:37-07:00, updated 2024-02-06T17:39:13-08:00 · History · Edit