Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models

Combining MoE with Instruction Tuning

This study advocates integrating Sparse Mixture-of-Experts (MoE) architecture with instruction tuning, demonstrating its superiority over traditional dense models.

Empirical Evidence and Findings

Extensive experiments across various setups show that MoE models excel remarkably in scenarios involving instruction tuning.

Introducing the FLAN-MOE32B Model

The FLAN-MOE32B model highlights the efficiency and scalability of the FLAN-MOE approach, outperforming larger models with reduced computational resources.

Rethinking Design Principles

The success of FLAN-MOE32B prompts a reevaluation of the design principles for large-scale, high-performance language models.

Performance Evaluation: Routing Strategies in MoE Models

Token-Choice and Expert-Choice Routing

The paper presents a detailed study comparing different routing strategies, specifically token-choice and expert-choice, in MoE models. These strategies play a pivotal role in determining the model's effectiveness and efficiency.

Token-Choice Routing (FLAN-Switch and FLAN-GS): This strategy, where tokens select the top-K experts, has shown to enhance performance across several benchmarks. For instance, the MMLU-Direct model demonstrated a significant improvement from 38.0% to 39.9% for BASE/LARGE-sized models when more experts were activated.
Expert-Choice Routing (FLAN-EC): The expert-choice strategy, which allows experts to select the top-K tokens, consistently outperformed the token-choice approach across various scales and tasks. This finding highlights the effectiveness of this routing strategy in improving the model’s performance.

Comparative Performance and Instruction Tuning

The study also revealed that instruction tuning significantly amplifies the performance of MoE models compared to dense models of equivalent capacity. Notably, the ST32B model, employing instruction-tuning, showed a dramatic performance increase of 45.2%, far outstripping the modest 6.6% improvement observed for the FLAN-PALM62B model. Furthermore, the expert-choice strategy (FLAN-EC) proved to be more effective than the token-choice approach (FLAN-GS), particularly when advanced auxiliary loss and pre-training strategies were integrated, as seen in the ST-MOE models.

Conclusion

The paper "Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models" marks a pivotal moment in NLP, presenting an efficient approach to language model development and setting a new benchmark for future research.

References

Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models. Link to paper
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. Link to paper
Towards Understanding Mixture of Experts in Deep Learning. Link to paper
Fast Feedforward Networks. Link to paper
Designing Effective Sparse Expert Models. Link to paper
OpenMoE. Link to github
Implementation of mixture-of-experts. Link to github
Mistral of experts. Link to blog

Created 2023-12-19T10:51:31-08:00