BayJarvis: Blogs on interpretability

paper Toy Models of Superposition - 2024-04-03

Neural networks often exhibit a puzzling phenomenon called "polysemanticity" where many unrelated concepts are packed into a single neuron, making interpretability challenging. This paper provides toy models to understand polysemanticity as a result of models storing additional sparse features in "superposition". Key findings include: …

paper Characterizing Large Language Models Geometry for Toxicity Detection and Generation - 2024-03-18

Abstract: Large Language Models (LLMs) drive significant advancements in AI, yet understanding their internal workings remains a challenge. This paper introduces a novel geometric perspective to characterize LLMs, offering practical insights into their functionality. By analyzing the intrinsic dimension of Multi-Head Attention (MHA) embeddings and the affine mappings within layer feed-forward networks, we unlock new ways to manipulate and interpret LLMs. Our findings enable bypassing restrictions like RLHF in models such as Llama2, and we introduce seven interpretable spline features extracted from any LLM layer. These features, tested on models like Mistral-7B and Llama2, prove highly effective in toxicity detection, domain inference, and addressing the Jigsaw challenge, showcasing the practical utility of our geometric characterization. …

paper Representation Engineering: Unraveling the Top-Down Approach to AI Transparency - 2023-11-02

In the ever-evolving world of artificial intelligence (AI), transparency remains a vital concern. With AI models becoming increasingly intricate and powerful, understanding their inner workings is not just a scientific pursuit but a necessity. Enter the realm of Representation Engineering, a fresh perspective on enhancing AI transparency. …