A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

Low-Rank Adapters (LoRA) have emerged as a popular parameter-efficient fine-tuning method for large language models. By adding trainable low-rank "adapters" to selected layers, LoRA enables effective fine-tuning while dramatically reducing the number of parameters that need to be trained. However, the conventional LoRA method uses a scaling factor for the adapters that divides them by the rank. A new paper by researcher Damjan Kalajdzievski shows that this rank-dependent scaling actually slows down learning and limits performance improvements when using higher-rank adapters.

Kalajdzievski proves mathematically that for optimal learning, LoRA adapters should actually be divided by the square root of the rank, not the rank itself. He proposes a modified method called rank-stabilized LoRA (rsLoRA) that uses this square root scaling factor.

Through experiments fine-tuning large language models, Kalajdzievski demonstrates that rsLoRA enables stable and effective learning even with very high adapter ranks (up to 2048 in the experiments). In contrast, with conventional LoRA, increasing the adapter rank provides little to no improvement, as the overly aggressive scaling causes the gradients to collapse.

The rsLoRA method provides an easy way to trade off extra computational cost during training for improved fine-tuning performance, by using higher rank adapters. Importantly, this incurs no additional cost during inference, since the adapters are collapsed into the original model weights after training. So one can use the highest rank possible given the available memory budget during fine-tuning to maximize performance.

This work shows that contrary to the belief that very low-rank adapters are sufficient for LoRA, using higher ranks with proper scaling can boost fine-tuning quality. It also motivates further research into how the dimensionality of the fine-tuning manifold relates theoretically and empirically to downstream performance.

Overall, rank-stabilized LoRA looks like a promising enhancement to the already highly successful LoRA method. It should allow more flexibility to improve language model fine-tuning, especially in scenarios where extra compute can be devoted to the fine-tuning process to eke out the best possible performance. I'm excited to see it further analyzed and put to use!

References

Related

Created 2024-03-14T15:03:27-07:00, updated 2024-03-16T07:35:17-07:00 · History · Edit