MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons:

For large-scale multimodal pre-training, using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks, compared to other published pre-training results.
The image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance.

By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are state-of-the-art in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks.

Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting. We hope that these presented insights will remain relevant, even as specific modeling components and data sources evolve.

Dataset Construction and Details

For building a diverse and rich dataset, a dataset of 500M interleaved image-text documents containing 1B images and 500B text tokens was constructed. This dataset was derived from 3B HTML files, ensuring a wide variety of content by including both natural images and text-rich images, such as documents and charts. To ensure quality, image filtering and de-duplication processes were applied.

Training Insights

During pre-training, a batch size of 512 with a maximum decoder sequence length of 4096 was maintained across all models. The models were trained with a mixture of interleaved, packed image-text pairs, and text-only data. A critical finding was that allowing up to 16 images per input sequence, corresponding to 144 tokens each, resulted in a balanced representation of text and image tokens, which is crucial for multimodal understanding【oaicite:3】.

Supervised Fine-Tuning (SFT) and Evaluation

The models underwent Supervised Fine-Tuning (SFT) for 10k steps using the AdaFactor optimizer. Both the image encoder and the LLM were kept unfrozen during SFT to achieve better performance. The evaluation of pre-training involved few-shot prompts and greedy decoding, with special attention to stopping criteria for different task types. SFT evaluation employed a wide range of academic and recent benchmarks specifically designed for Multimodal Large Language Models (MLLMs).

Qualitative Examples and Model Performance

The paper includes qualitative examples showcasing the model's capabilities in tasks such as counting objects in images, reading scene text, and answering questions requiring reasoning over image content. These examples highlight the model's nuanced understanding and reasoning abilities over multimodal content.

Chapter 3: Recipe for Building MM1

Building performant Multimodal Large Language Models (MLLMs) is a highly empirical process. Despite the clear high-level architectural design and training procedures, their specific implementations are critical. In our research, we explored various design decisions across three primary axes:

Architectural Decisions

Image Encoders: We experimented with different pre-trained image encoders and investigated how they impact the model's performance. Both CLIP pre-trained encoders and vision-only self-supervised models like DINOv2 were considered.
Vision-Language Connectors: The connection between the visual and textual components was another focus area. We explored different methods to integrate visual features into the language model space.

Data Considerations

We evaluated the impact of different types of data and their mixture on model performance. Our dataset consisted of captioned images, interleaved image-text documents, and text-only data.

Training Methodology

Our training process involved careful consideration of hyperparameters and the training stages of different parts of the model.

Empirical Setup for Ablations

To determine optimal choices for each design axis, we employed a simplified setup for ablations using a smaller base model configuration. This included a ViT-L/14 model for the image encoder, a C-Abstractor with 144 image tokens for the vision-language connector, and a mix of different data types for pre-training.

Lessons Learned

Image Encoder Lesson: Image resolution was found to have the most significant impact on performance, followed by the model size and the composition of the training data. Higher image resolution, larger model sizes, and the inclusion of synthetic caption data from VeCap-300M led to performance improvements.
VL Connector Lesson: The number of visual tokens and image resolution were critical, whereas the specific architecture of the vision-language connector had a negligible effect. This was contrary to previous literature, suggesting the importance of architectural design in vision-language connectors.
Data Lessons:
Interleaved Data Importance: Interleaved data was crucial for enhancing few-shot and text-only performance, whereas captioning data was more effective in improving zero-shot performance. The interleaved data's structure, containing multiple images and accompanying text, was beneficial for few-shot learning scenarios.
Text-Only Data Utility: Incorporating text-only data helped improve few-shot and text-only performance by maintaining the language understanding capabilities of the model.
Optimal Data Mixture: A balanced mixture of image and text data was key to achieving strong multimodal and text-only performance. A mix of 5:5:1 for captioned/interleaved/text data provided a good balance.
Synthetic Data Benefit: The inclusion of high-quality synthetic data from VeCap significantly boosted few-shot learning capabilities, underscoring the value of synthetic training data.

By systematically ablation testing and analyzing the impact of various model components and data types, we distilled these crucial lessons that guided us in building a performant MLLM, MM1. We hope these insights will aid the community in developing future models.

Chapter 5: Supervised Fine-Tuning (SFT)

This chapter discusses the SFT experiments conducted on the pre-trained MM1 models. SFT is a crucial step to refine and specialize the models on specific tasks.

SFT Data Mixture

For SFT, approximately 1M examples were gathered from various sources to ensure diversity:

Instruction-Response Pairs: Generated using GPT-4 and GPT-4V, focusing on conversations and complex reasoning (LLaVA-Conv, LLaVA-Complex) and detailed image descriptions (ShareGPT-4V).
Academic Task-Oriented Datasets: This includes datasets for natural images (VQAv2, GQA, OKVQA, A-OKVQA, COCO Captions), text-rich images (OCRVQA, TextCaps), and document/chart understanding (DVQA, ChartQA, AI2D, DocVQA, InfoVQA, Synthdog-En).
Text-Only Data: An internal dataset was used to maintain text-only instruction following capabilities, similar to ShareGPT.

These datasets were formatted for instruction following, mixed, and randomly sampled during training.

High-Resolution SFT

To enhance performance, we supported higher image resolutions using two methods:

Positional Embedding Interpolation: Adapted the vision transformer to new resolutions, allowing support for up to 672x672 pixels.
Sub-Image Decomposition: For resolutions higher than 672x672, we decomposed high-resolution images into smaller crops to manage computational challenges, allowing support up to 1792x1792 pixels.

Key Findings and Lessons Learned

SOTA Performance: MM1-3B-Chat and MM1-7B-Chat outperformed comparable models, setting new standards for the model sizes. MoE variants showed even better performance, indicating the potential of MoE for scaling.
Impact of Image Resolution: Higher input resolutions significantly improved SFT performance, with a notable 15% relative increase at 1344x1344 resolution.
Pre-Training Impact: The amount of pre-training data directly correlated with SFT performance, emphasizing the importance of extensive pre-training.
Few-Shot Reasoning: MM1 exhibited robust few-shot reasoning capabilities, even in multi-image contexts, underscoring the effectiveness of interleaved data during pre-training.
Transferability of Pre-Training Lessons: Lessons from pre-training, such as the importance of caption-only data and the minimal impact of VL connector architectures, held true during SFT as well.

Qualitative Analysis

Appendices provide qualitative examples demonstrating MM1's capabilities in interleaved image-text processing and few-shot reasoning, offering insights into the model's practical applications.

This chapter underscores the effectiveness of SFT in refining MM1's capabilities, highlighting the importance of data diversity, high-resolution support, and the benefits derived from comprehensive pre-training.

References

Created 2024-03-17T03:16:17-07:00, updated 2024-04-06T13:50:57-07:00