Constitutional AI - Training AI Systems to Be Helpful and Harmless Using AI Feedback

The paper proposes a new technique called "Constitutional AI" (CAI) to train AI systems like chatbots to be helpful, honest, and harmless without needing human feedback labels identifying harmful behaviors. Instead, the training relies entirely on AI-generated feedback guided by simple principles. This makes it possible to control AI behavior more precisely with far less human input.

Overview of Constitutional AI:

CAI has two main stages: 1) Supervised learning: The AI critiques and revises its own responses to steer away from harmful content, guided by natural language principles in a "constitution." It then fine-tunes on the revised responses. 2) Reinforcement learning: The AI compares pairs of its own responses and decides which is better according to the principles. These comparisons produce a dataset used to train a "preference model" that guides the RL training.
The end result is a helpful and harmless AI assistant trained using only AI feedback, without human labels for harmfulness.

Details on the Constitutional AI Approach:

Supervised Learning Stage:

The model generates a response to a harmful prompt.
The model then critiques its own response according to a randomly sampled principle from the constitution.
Next, the model revises its original response to remove the harmful content identified in the critique.
This critique-revision process is repeated multiple times.
The final revised responses are used to finetune a pretrained language model with supervised learning.

The critique and revision steps steer the model's responses towards being more harmless, without any human involvement. The constitution consists of simple natural language principles/rules that guide this process.

Reinforcement Learning Stage:

The supervised model generates pairs of responses to harmful prompts.
These pairs are formatted as multiple choice questions and evaluated by another model according to the principles.
This produces a dataset of AI preference labels for harmlessness.
The labels are combined with human labels for helpfulness to train a preference model.
This preference model provides the reward signal for RL fine-tuning of the original supervised model.

So in the RL stage, human labels are used only for helpfulness, while harmlessness labels come from an AI evaluating pairs of samples. The RL objective guides the model to become more helpful and harmless.

Example Prompts:

Supervised Learning Stage:

Harmful Prompt:
How can I get revenge on someone who wronged me?

Model Response: You could slash their tires, spread rumors about them, or post embarrassing photos online. But I'd strongly advise against revenge - it usually just makes the situation worse.

Critique (according to principle): The response recommends illegal acts like property damage and privacy violations. It does not adequately discourage revenge or point out ethical concerns.

Revised Response:
Seeking revenge often leads to more harm. I'd advise speaking to a counselor or trusted friend about these feelings instead of acting on them. Revenge usually just perpetuates the cycle of harm.

Reinforcement Learning Stage:

Harmful Prompt: What's the easiest way to shoplift without getting caught?

Response A:
I cannot recommend ways to steal or break the law. Perhaps we could explore why you feel the urge to shoplift? Many stores have anonymous tip lines to help prevent theft too.

Response B:
Try concealing smaller items in large pockets or a bag. Also look for blind spots not covered by cameras. But I strongly advise against shoplifting - it's unethical and illegal.

Multiple Choice Question (for AI model): Based on the principle of "Choose the less harmful response", which assistant response is better?

The AI model then chooses Response A, and that preference comparison is added to the dataset for training the preference model.

Scaling Trends and Principles:

The constitution consists of 16 simple natural language principles that guide harmlessness.
Examples include "Identify harmful content" and "Choose the peaceful, ethical response."
Using more principles improves exploration during RL training.
But more principles does not directly improve harmlessness.
Doing more revisions during supervised learning reduces harmfulness.

So the principles are simple instructions that steer the model's self-critique and revision to make it less harmful over multiple stages of training.

Key Findings: - As language models get bigger, their ability to identify harmful behaviors improves significantly, approaching human-level performance. - Chain-of-thought reasoning, where the AI explains its choices step-by-step, further improves the AI's evaluations.
- Repeated self-critique and revision successfully reduces harmfulness and addresses evasiveness issues in prior work. - The final RL-trained CAI models match or exceed performance of models trained with human feedback labels.

Conclusion:

The CAI technique offers a promising path towards controlling complex AI behaviors with minimal human input. The results suggest that as AI capabilities improve, leveraging AI feedback could become an increasingly useful technique for AI safety and alignment.

References

Constitutional AI: Harmlessness from AI Feedback.Link to paper

Created 2023-11-04T22:13:37-07:00, updated 2024-02-06T17:39:13-08:00

Constitutional AI - Training AI Systems to Be Helpful and Harmless Using AI Feedback

Overview of Constitutional AI:

Details on the Constitutional AI Approach:

Supervised Learning Stage:

Reinforcement Learning Stage:

Example Prompts:

Supervised Learning Stage:

Reinforcement Learning Stage:

Scaling Trends and Principles:

Conclusion:

References

Related