Toy Models of Superposition

Neural networks often exhibit a puzzling phenomenon called "polysemanticity" where many unrelated concepts are packed into a single neuron, making interpretability challenging. This paper provides toy models to understand polysemanticity as a result of models storing additional sparse features in "superposition". Key findings include:

Feature Dimensionality

To quantify the degree of superposition for each feature, the paper introduces the concept of "feature dimensionality". Feature dimensionality measures the fraction of a hidden dimension that is dedicated to representing a particular feature.

The formula for feature dimensionality is:

$D_i = \frac{(\sum_j (\hat{W}_i \cdot \hat{W}_j)^2)}{||W_i||^2}$

where $W_i$ is the weight vector corresponding to the i-th feature, and $\hat{W}_i$ is the normalized version of that vector.

Intuitively, the numerator measures how much the i-th feature interferes with other features (via the dot product), while the denominator normalizes by the magnitude of the feature's weight vector.

If a feature is represented by a neuron that is orthogonal to all other neurons, it will have a dimensionality of 1 (it gets a full dimension to itself). If a feature is not represented at all, it will have a dimensionality of 0. And if a feature is represented in superposition with other features, it will have a fractional dimensionality.

Feature dimensionality is a key tool for understanding how superposition evolves during training. By tracking the dimensionality of each feature over time, we can see how the model adjusts its internal representations, and how features enter and leave superposition.

It's also a way to quantify the trade-off between the number of features a model can represent and the interference between those features. A model with high feature dimensionality (close to 1 for most features) will have little superposition but may not be able to represent many features. A model with low feature dimensionality will have a lot of superposition, allowing it to represent many features but at the cost of interference.

Learning Dynamics of Superposition

The paper also explores how superposition evolves during training in the toy models. Two striking phenomena are observed:

  1. Discrete "Energy Level" Jumps: In models with many features, learning dynamics are dominated by features jumping between different "dimensionalities" (fraction of a dimension dedicated to a feature). These jumps correspond to sudden drops in the loss curve, suggesting that the seemingly smooth learning curves of larger models may actually be composed of many small discrete jumps between feature configurations.

The relationship between these jumps and superposition is complex. The jumps represent sudden transitions between different superposition configurations. During a jump, the model rearranges its internal representations, which can involve features entering or leaving superposition, or changing how they are superimposed. However, a single jump does not necessarily mean the complete elimination or introduction of superposition. It's more like a reorganization of the superposition structure, where superposition might decrease for some features and increase for others.

The key point is that these changes happen discretely, not continuously, and they have a significant impact on the model's performance (as seen in the sudden drops in loss). This suggests that the learning process in neural networks, even though it's driven by continuous optimization algorithms like gradient descent, can have discrete, quantum-like behaviors when it comes to the internal representations.

  1. Learning as Geometric Transformations: In some cases, the learning dynamics leading to the geometric structures of superposition can be understood as a sequence of simple, independent geometric transformations. For example, with correlated features, learning proceeds through distinct regimes visible in the loss curve, each corresponding to a specific geometric transformation of the feature embeddings.

These findings relate to previous work showing that early in training, neural networks learn linear approximations before moving to better nonlinear solutions, and that certain networks "split" feature embeddings in a hierarchical manner. Understanding the learning dynamics of superposition provides insight into how these geometric structures emerge during training.

Should We Eliminate Superposition?

Whether we should eliminate superposition depends on our priorities. Superposition has both benefits and drawbacks:

Benefits:

Drawbacks:

If interpretability is a high priority, eliminating superposition could be beneficial. If efficiency is more important, superposition could be a useful tool. The paper suggests ways to control superposition, such as L1 regularization or adversarial training, which could help find a balance.

This work demonstrates superposition in simple models and examines implications for interpretability. Open questions remain around prevalence in large real-world models and scaling properties. Developing techniques to control and decode superposition is an important direction for future interpretability research.

Refereces

Related

Created 2024-04-03T22:24:07-07:00 · Edit