Toy Models of Superposition

Neural networks often exhibit a puzzling phenomenon called "polysemanticity" where many unrelated concepts are packed into a single neuron, making interpretability challenging. This paper provides toy models to understand polysemanticity as a result of models storing additional sparse features in "superposition". Key findings include:

Superposition allows models to store more features than they have dimensions, at the cost of "interference" that requires nonlinear filtering. Models can noisily simulate larger, highly sparse networks.
There is a phase change between models storing features directly and storing them in superposition. Sparsity governs whether features are stored in superposition.
Features in superposition organize into specific geometric structures such as digons, triangles, pentagons, tetrahedrons, etc. This connects to the geometry of uniform polytopes.
Models can perform some types of computation, like computing absolute value, while features are in superposition. This suggests models may be simulating larger sparse networks.
Superposition exhibits discrete "energy level" jumps during training as features rearrange between geometric configurations. Learning dynamics involve a sequence of simple geometric transformations.
Superposition makes models more vulnerable to adversarial examples by allowing attacks on important features via interference terms. Adversarial training reduces superposition.
Privileged bases, induced by nonlinearities, cause features to align with neurons, producing mix of monosemantic and polysemantic neurons as seen in real networks.
Superposition deeply impacts interpretability - enumerating features is key for strong interpretability and making safety claims. Potential solutions include building models without superposition, finding overcomplete bases describing feature geometry, or hybrid approaches.

Feature Dimensionality

To quantify the degree of superposition for each feature, the paper introduces the concept of "feature dimensionality". Feature dimensionality measures the fraction of a hidden dimension that is dedicated to representing a particular feature.

The formula for feature dimensionality is:

$D_i = \frac{(\sum_j (\hat{W}_i \cdot \hat{W}_j)^2)}{||W_i||^2}$

where $W_i$ is the weight vector corresponding to the i-th feature, and $\hat{W}_i$ is the normalized version of that vector.

Intuitively, the numerator measures how much the i-th feature interferes with other features (via the dot product), while the denominator normalizes by the magnitude of the feature's weight vector.

If a feature is represented by a neuron that is orthogonal to all other neurons, it will have a dimensionality of 1 (it gets a full dimension to itself). If a feature is not represented at all, it will have a dimensionality of 0. And if a feature is represented in superposition with other features, it will have a fractional dimensionality.

Feature dimensionality is a key tool for understanding how superposition evolves during training. By tracking the dimensionality of each feature over time, we can see how the model adjusts its internal representations, and how features enter and leave superposition.

It's also a way to quantify the trade-off between the number of features a model can represent and the interference between those features. A model with high feature dimensionality (close to 1 for most features) will have little superposition but may not be able to represent many features. A model with low feature dimensionality will have a lot of superposition, allowing it to represent many features but at the cost of interference.

Learning Dynamics of Superposition

The paper also explores how superposition evolves during training in the toy models. Two striking phenomena are observed:

Discrete "Energy Level" Jumps: In models with many features, learning dynamics are dominated by features jumping between different "dimensionalities" (fraction of a dimension dedicated to a feature). These jumps correspond to sudden drops in the loss curve, suggesting that the seemingly smooth learning curves of larger models may actually be composed of many small discrete jumps between feature configurations.

The relationship between these jumps and superposition is complex. The jumps represent sudden transitions between different superposition configurations. During a jump, the model rearranges its internal representations, which can involve features entering or leaving superposition, or changing how they are superimposed. However, a single jump does not necessarily mean the complete elimination or introduction of superposition. It's more like a reorganization of the superposition structure, where superposition might decrease for some features and increase for others.

The key point is that these changes happen discretely, not continuously, and they have a significant impact on the model's performance (as seen in the sudden drops in loss). This suggests that the learning process in neural networks, even though it's driven by continuous optimization algorithms like gradient descent, can have discrete, quantum-like behaviors when it comes to the internal representations.

Learning as Geometric Transformations: In some cases, the learning dynamics leading to the geometric structures of superposition can be understood as a sequence of simple, independent geometric transformations. For example, with correlated features, learning proceeds through distinct regimes visible in the loss curve, each corresponding to a specific geometric transformation of the feature embeddings.

These findings relate to previous work showing that early in training, neural networks learn linear approximations before moving to better nonlinear solutions, and that certain networks "split" feature embeddings in a hierarchical manner. Understanding the learning dynamics of superposition provides insight into how these geometric structures emerge during training.

Should We Eliminate Superposition?

Whether we should eliminate superposition depends on our priorities. Superposition has both benefits and drawbacks:

Benefits:

Efficiency: Superposition allows models to represent more features than they have neurons, acting as a form of compression.
Potential for Computation: Models can perform some types of computation directly on superimposed representations.

Drawbacks:

Interpretability: Superposition makes it harder to interpret what a model is doing, as a single neuron can represent multiple features.
Interference: Features in superposition can interfere with each other, requiring nonlinear computations to disentangle.
Adversarial Vulnerabilities: Superposition can make models more vulnerable to adversarial attacks.

If interpretability is a high priority, eliminating superposition could be beneficial. If efficiency is more important, superposition could be a useful tool. The paper suggests ways to control superposition, such as L1 regularization or adversarial training, which could help find a balance.

This work demonstrates superposition in simple models and examines implications for interpretability. Open questions remain around prevalence in large real-world models and scaling properties. Developing techniques to control and decode superposition is an important direction for future interpretability research.

Refereces

Created 2024-04-03T22:24:07-07:00

Toy Models of Superposition

Feature Dimensionality

Learning Dynamics of Superposition

Should We Eliminate Superposition?

Refereces

Related