GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Training Large Language Models (LLMs) presents significant memory challenges predominantly due to the growing size of weights and optimizer states. While common memory-reduction approaches, such as Low-Rank Adaptation (LoRA), have been employed to mitigate these challenges, they typically underperform training with full-rank weights in both pre-training and fine-tuning stages. This limitation arises because these approaches restrict the parameter search to a low-rank subspace, altering training dynamics and potentially requiring a full-rank warm start.

In this work, we introduce Gradient Low-Rank Projection (GaLore), a training strategy that enables full-parameter learning while being more memory-efficient than conventional low-rank adaptation methods. GaLore reduces memory usage by up to 65.5% in optimizer states, maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with the C4 dataset, and fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3% compared to a BF16 baseline.

Notably, GaLore enables the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without resorting to model parallel checkpointing or offloading strategies. This is achieved by leveraging the slow-changing low-rank structure of the gradient matrix rather than approximating the weight matrix itself as low-rank. We demonstrate that GaLore works well in both LLM pre-training and fine-tuning, maintaining competitive performance while significantly reducing memory requirements.

1. Introduction

Large Language Models (LLMs) have revolutionized various fields, including conversational AI and language translation, by their impressive performance. However, pre-training and fine-tuning LLMs are not only computationally intensive but also memory-intensive, requiring billions of trainable parameters and their gradients and optimizer states.

2. Related Works

Previous attempts to reduce memory usage during pre-training and fine-tuning include Parameter-efficient fine-tuning (PEFT) techniques and Low-Rank Adaptation (LoRA). However, these methods have limitations, such as the inability to reach comparable performance with full-rank fine-tuning or requiring a full-rank model training as a warmup.

3. GaLore: Gradient Low-Rank Projection

GaLore proposes a novel approach by utilizing the low-rank structure of the gradient matrix, allowing full-parameter learning with reduced memory usage. Theoretical analysis and practical implementation demonstrate that GaLore's gradient low-rank projection efficiently reduces memory consumption without compromising training dynamics or performance.

GaLore Algorithm

GaLore introduces an efficient training strategy through gradient low-rank projection. The algorithm can be outlined in a PyTorch-like pseudocode as follows:

python for weight in model.parameters(): grad = weight.grad # original space -> compact space lor_grad = project(grad) # update by Adam, Adafactor, etc. lor_update = update(lor_grad) # compact space -> original space update = project_back(lor_update) weight.data += update

This algorithm enables the efficient adaptation of pre-trained language models to downstream applications without fine-tuning all parameters. By leveraging low-rank projections, GaLore reduces the memory footprint significantly, making it feasible to train large models on consumer-grade hardware.

4. Experimental Results

Our experiments show that GaLore achieves memory-efficient pre-training and fine-tuning on LLaMA architectures and GLUE tasks, outperforming existing low-rank methods while preserving competitive performance. GaLore's compatibility with various optimizers, including memory-efficient ones like 8-bit Adam, further demonstrates its versatility and effectiveness in reducing memory usage.

References

Created 2024-03-21T23:09:25-07:00