BayJarvis: Blogs on safety

paper Characterizing Large Language Models Geometry for Toxicity Detection and Generation - 2024-03-18

Abstract: Large Language Models (LLMs) drive significant advancements in AI, yet understanding their internal workings remains a challenge. This paper introduces a novel geometric perspective to characterize LLMs, offering practical insights into their functionality. By analyzing the intrinsic dimension of Multi-Head Attention (MHA) embeddings and the affine mappings within layer feed-forward networks, we unlock new ways to manipulate and interpret LLMs. Our findings enable bypassing restrictions like RLHF in models such as Llama2, and we introduce seven interpretable spline features extracted from any LLM layer. These features, tested on models like Mistral-7B and Llama2, prove highly effective in toxicity detection, domain inference, and addressing the Jigsaw challenge, showcasing the practical utility of our geometric characterization. …

paper Constitutional AI - Training AI Systems to Be Helpful and Harmless Using AI Feedback - 2023-11-04

The paper proposes a new technique called "Constitutional AI" (CAI) to train AI systems like chatbots to be helpful, honest, and harmless without needing human feedback labels identifying harmful behaviors. Instead, the training relies entirely on AI-generated feedback guided by simple principles. This makes it possible to control AI behavior more precisely with far less human input. …

paper Representation Engineering: Unraveling the Top-Down Approach to AI Transparency - 2023-11-02

In the ever-evolving world of artificial intelligence (AI), transparency remains a vital concern. With AI models becoming increasingly intricate and powerful, understanding their inner workings is not just a scientific pursuit but a necessity. Enter the realm of Representation Engineering, a fresh perspective on enhancing AI transparency. …