MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons:

By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are state-of-the-art in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks.

Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting. We hope that these presented insights will remain relevant, even as specific modeling components and data sources evolve.

Dataset Construction and Details

For building a diverse and rich dataset, a dataset of 500M interleaved image-text documents containing 1B images and 500B text tokens was constructed. This dataset was derived from 3B HTML files, ensuring a wide variety of content by including both natural images and text-rich images, such as documents and charts. To ensure quality, image filtering and de-duplication processes were applied.

Training Insights

During pre-training, a batch size of 512 with a maximum decoder sequence length of 4096 was maintained across all models. The models were trained with a mixture of interleaved, packed image-text pairs, and text-only data. A critical finding was that allowing up to 16 images per input sequence, corresponding to 144 tokens each, resulted in a balanced representation of text and image tokens, which is crucial for multimodal understanding​【oaicite:3】​.

Supervised Fine-Tuning (SFT) and Evaluation

The models underwent Supervised Fine-Tuning (SFT) for 10k steps using the AdaFactor optimizer. Both the image encoder and the LLM were kept unfrozen during SFT to achieve better performance. The evaluation of pre-training involved few-shot prompts and greedy decoding, with special attention to stopping criteria for different task types. SFT evaluation employed a wide range of academic and recent benchmarks specifically designed for Multimodal Large Language Models (MLLMs).

Qualitative Examples and Model Performance

The paper includes qualitative examples showcasing the model's capabilities in tasks such as counting objects in images, reading scene text, and answering questions requiring reasoning over image content. These examples highlight the model's nuanced understanding and reasoning abilities over multimodal content.

Chapter 3: Recipe for Building MM1

Building performant Multimodal Large Language Models (MLLMs) is a highly empirical process. Despite the clear high-level architectural design and training procedures, their specific implementations are critical. In our research, we explored various design decisions across three primary axes:

Architectural Decisions

Data Considerations

Training Methodology

Empirical Setup for Ablations

To determine optimal choices for each design axis, we employed a simplified setup for ablations using a smaller base model configuration. This included a ViT-L/14 model for the image encoder, a C-Abstractor with 144 image tokens for the vision-language connector, and a mix of different data types for pre-training.

Lessons Learned

  1. Image Encoder Lesson: Image resolution was found to have the most significant impact on performance, followed by the model size and the composition of the training data. Higher image resolution, larger model sizes, and the inclusion of synthetic caption data from VeCap-300M led to performance improvements.
  2. VL Connector Lesson: The number of visual tokens and image resolution were critical, whereas the specific architecture of the vision-language connector had a negligible effect. This was contrary to previous literature, suggesting the importance of architectural design in vision-language connectors.
  3. Data Lessons:
  4. Interleaved Data Importance: Interleaved data was crucial for enhancing few-shot and text-only performance, whereas captioning data was more effective in improving zero-shot performance. The interleaved data's structure, containing multiple images and accompanying text, was beneficial for few-shot learning scenarios.
  5. Text-Only Data Utility: Incorporating text-only data helped improve few-shot and text-only performance by maintaining the language understanding capabilities of the model.
  6. Optimal Data Mixture: A balanced mixture of image and text data was key to achieving strong multimodal and text-only performance. A mix of 5:5:1 for captioned/interleaved/text data provided a good balance.
  7. Synthetic Data Benefit: The inclusion of high-quality synthetic data from VeCap significantly boosted few-shot learning capabilities, underscoring the value of synthetic training data.

By systematically ablation testing and analyzing the impact of various model components and data types, we distilled these crucial lessons that guided us in building a performant MLLM, MM1. We hope these insights will aid the community in developing future models.

Chapter 5: Supervised Fine-Tuning (SFT)

This chapter discusses the SFT experiments conducted on the pre-trained MM1 models. SFT is a crucial step to refine and specialize the models on specific tasks.

SFT Data Mixture

For SFT, approximately 1M examples were gathered from various sources to ensure diversity:

These datasets were formatted for instruction following, mixed, and randomly sampled during training.

High-Resolution SFT

To enhance performance, we supported higher image resolutions using two methods:

  1. Positional Embedding Interpolation: Adapted the vision transformer to new resolutions, allowing support for up to 672x672 pixels.
  2. Sub-Image Decomposition: For resolutions higher than 672x672, we decomposed high-resolution images into smaller crops to manage computational challenges, allowing support up to 1792x1792 pixels.

Key Findings and Lessons Learned

Qualitative Analysis

Appendices provide qualitative examples demonstrating MM1's capabilities in interleaved image-text processing and few-shot reasoning, offering insights into the model's practical applications.

This chapter underscores the effectiveness of SFT in refining MM1's capabilities, highlighting the importance of data diversity, high-resolution support, and the benefits derived from comprehensive pre-training.

References

Related

Created 2024-03-17T03:16:17-07:00, updated 2024-04-06T13:50:57-07:00 · History · Edit