OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
A team of researchers has released OpenMoE, a series of open-source Mixture-of-Experts (MoE) based large language models ranging from 650M to 34B parameters. Their work provides valuable insights into training MoE models and analyzing their behavior. Here are some key takeaways:
Cost-Effective Scaling with MoE
- OpenMoE models demonstrate that MoE-based LLMs can provide a more favorable cost-effectiveness tradeoff compared to dense LLMs.
- Using MoE layers in an interleaved manner (e.g. every 4-6 layers) rather than every layer can achieve a better efficiency-performance balance.
Exploring Advanced Training Strategies
- The researchers used a higher proportion of code data (up to 52%) in pre-training compared to typical LLMs. Code may help with reasoning abilities.
- They investigated using the UL2 pre-training objective in addition to standard causal language modeling. UL2 aligns well with code data.
In-Depth MoE Routing Analysis
The paper includes a detailed study of the routing mechanism in MoE models, uncovering several novel findings:
- Context-Independent Specialization
-
The router tends to assign certain tokens to certain experts regardless of context, based mainly on the token ID rather than semantics.
-
Early Routing Learning
-
The token-to-expert assignments are established early in pre-training and remain largely fixed, with tokens processed by the same experts throughout.
-
Drop-towards-the-End Issue
- Since each expert has a fixed capacity, tokens appearing later in a sequence are at higher risk of being dropped if the expert is full.
- This is more severe on instruction-tuning data that differs from the routing patterns learned during pre-training.
Retrospective and Future Directions
- The authors discuss some suboptimal design choices in hindsight, such as overly aggressive code data mixing. Sharing these insights is highly valuable.
- Based on their findings, they propose future strategies like:
- Removing the trainable router after learning
- Parallel computation of MoE and attention
- Mixing instruction data during pre-training to improve routing
In summary, OpenMoE represents an important step in open-sourcing MoE-LLMs and provides an unprecedented analysis of their inner workings. The insights gained pave the way for more efficient and effective MoE language model development going forward.
References
Related
Created 2024-03-20T23:28:25-07:00, updated 2024-03-24T09:23:05-07:00 · History · Edit