Neural networks often exhibit a puzzling phenomenon called "polysemanticity" where many unrelated concepts are packed into a single neuron, making interpretability challenging. This paper provides toy models to understand polysemanticity as a result of models storing additional sparse features in "superposition". Key findings include:
Superposition allows models to store more features than they have dimensions, at the cost of "interference" that requires nonlinear filtering. Models can noisily simulate larger, highly sparse networks.
There is a phase change between models storing features directly and storing them in superposition. Sparsity governs whether features are stored in superposition.
Features in superposition organize into specific geometric structures such as digons, triangles, pentagons, tetrahedrons, etc. This connects to the geometry of uniform polytopes.
Models can perform some types of computation, like computing absolute value, while features are in superposition. This suggests models may be simulating larger sparse networks.
Superposition exhibits discrete "energy level" jumps during training as features rearrange between geometric configurations. Learning dynamics involve a sequence of simple geometric transformations.
Superposition makes models more vulnerable to adversarial examples by allowing attacks on important features via interference terms. Adversarial training reduces superposition.
Privileged bases, induced by nonlinearities, cause features to align with neurons, producing mix of monosemantic and polysemantic neurons as seen in real networks.
Superposition deeply impacts interpretability - enumerating features is key for strong interpretability and making safety claims. Potential solutions include building models without superposition, finding overcomplete bases describing feature geometry, or hybrid approaches.
To quantify the degree of superposition for each feature, the paper introduces the concept of "feature dimensionality". Feature dimensionality measures the fraction of a hidden dimension that is dedicated to representing a particular feature.
The formula for feature dimensionality is:
$D_i = \frac{(\sum_j (\hat{W}_i \cdot \hat{W}_j)^2)}{||W_i||^2}$
where $W_i$ is the weight vector corresponding to the i-th feature, and $\hat{W}_i$ is the normalized version of that vector.
Intuitively, the numerator measures how much the i-th feature interferes with other features (via the dot product), while the denominator normalizes by the magnitude of the feature's weight vector.
If a feature is represented by a neuron that is orthogonal to all other neurons, it will have a dimensionality of 1 (it gets a full dimension to itself). If a feature is not represented at all, it will have a dimensionality of 0. And if a feature is represented in superposition with other features, it will have a fractional dimensionality.
Feature dimensionality is a key tool for understanding how superposition evolves during training. By tracking the dimensionality of each feature over time, we can see how the model adjusts its internal representations, and how features enter and leave superposition.
It's also a way to quantify the trade-off between the number of features a model can represent and the interference between those features. A model with high feature dimensionality (close to 1 for most features) will have little superposition but may not be able to represent many features. A model with low feature dimensionality will have a lot of superposition, allowing it to represent many features but at the cost of interference.
The paper also explores how superposition evolves during training in the toy models. Two striking phenomena are observed:
The relationship between these jumps and superposition is complex. The jumps represent sudden transitions between different superposition configurations. During a jump, the model rearranges its internal representations, which can involve features entering or leaving superposition, or changing how they are superimposed. However, a single jump does not necessarily mean the complete elimination or introduction of superposition. It's more like a reorganization of the superposition structure, where superposition might decrease for some features and increase for others.
The key point is that these changes happen discretely, not continuously, and they have a significant impact on the model's performance (as seen in the sudden drops in loss). This suggests that the learning process in neural networks, even though it's driven by continuous optimization algorithms like gradient descent, can have discrete, quantum-like behaviors when it comes to the internal representations.
These findings relate to previous work showing that early in training, neural networks learn linear approximations before moving to better nonlinear solutions, and that certain networks "split" feature embeddings in a hierarchical manner. Understanding the learning dynamics of superposition provides insight into how these geometric structures emerge during training.
Whether we should eliminate superposition depends on our priorities. Superposition has both benefits and drawbacks:
Benefits:
Drawbacks:
If interpretability is a high priority, eliminating superposition could be beneficial. If efficiency is more important, superposition could be a useful tool. The paper suggests ways to control superposition, such as L1 regularization or adversarial training, which could help find a balance.
This work demonstrates superposition in simple models and examines implications for interpretability. Open questions remain around prevalence in large real-world models and scaling properties. Developing techniques to control and decode superposition is an important direction for future interpretability research.
Created 2024-04-03T22:24:07-07:00 · Edit