A guest post by Rachel Shalom

In this blog, I aim to unravel the intricacies of Low Rank Adaptation (LoRA) and offer you a clear insight into why this method is effective. Rest assured, I’ll equip you with all the mathematical knowledge necessary to grasp LoRA comprehensively, but first, let’s delve into the necessity of LoRA.

# The Challenges with Fine-tuning

Training LLMs is computationally intensive. Full fine-tuning requires memory, not just to store the model but various other parameters that are required during the training process.

Even if your computer can hold the model’s weights in memory, these can grow to hundreds of gigabytes for very large models (like Falcon 180B and llama 2 70B). In addition, you must also be able to allocate memory of optimizer states, gradients, forward activations and temporary memory throughout the training process.

These additional components can easily surpass the model’s size by several folds, swiftly surpassing the handling capacity of standard consumer hardware.

In contrast to full fine-tuning — where every model weight is updated during supervised learning, parameter efficient fine-tuning methods like LoRA only updates small subsets of parameters.

As a result, the number of trained parameters/weights is much smaller than the original LLM’s weights. This makes memory requirements for training much more manageable.

In the next part of this blog, we’ll explore matrix properties like rank and factorization. If math isn’t your cup of tea (oh dear!), feel free to skip this section and focus on the more intuitive aspects of matrix rank.

# Why do we care about Matrix Rank? intuition

Generally speaking, datasets can be represented as matrices. The rank of a matrix provides insights into the underlying relationships and dependencies within this data.

A higher rank indicates that the data has a more complex structure, with many linearly independent features or patterns. This means that the data contains a significant amount of unique information, and reducing its dimensionality without losing important insights can be challenging.

On the other hand, a lower rank suggests that the data is more compressible, meaning that there are dependencies and redundancies within the data. In practical terms, this can be leveraged for dimensionality reduction and data compression, allowing you to represent the data more efficiently without losing crucial information.

The same stands for LLMs. They are basically a set of very large matrices multiplied by each other. The parameters (matrices) represent the learned knowledge that these models have about the world and sometimes can be redundant-meaning it’s possible to represent the same knowledge with fewer parameters. And this is What LoRA is all about.

# Refresh some Linear Algebra concepts

## Before we delve into the intricacies of LoRA, let’s take a step back to revisit some fascinating concepts related to vectors and matrices (exciting, isn’t it?)

1. Linearly Dependent Vectors

A set of vectors in a vector space where at least one vector in the set can be expressed as a linear combination of others, or you can get one from a scalar multiplication of the other. For example, consider the following two vectors

These vectors are linearly dependent because you can express v2 as a scalar multiple of v1:

On the other hand, linearly independent vectors are vectors that “stand on their own“ and are not redundant in terms of their linear relationship. They contribute unique information to the vector space and none can be represented as a combination of others.

2. Matrix Rank

Matrix Rank is the number of linearlyindependent columns (or rows) in a given matrix. This is also called the dimension of the vector space generated by the matrix columns.

It can be proven that the number of linearly independent rows in a matrix is equal to the number of linearly independent columns.

So if we have a matrix V with m rows and n columns, then:

Let’s look at the following example of a matrix V with 3 rows and 3 columns

You can notice that we can get the third row vector v3 by a combination of v1 and v2

So vector v3 is linearly dependent on v1 and v2.

Matrices can be classified based on their Rank:

AFully Ranked Matrix is when all rows or all columns of a matrix are linearly independent. In mathematical notation:

B. Rank Deficient Matrix is when 1 or more rows in the matrix are a combination of other rows. In mathematical notation:

Here is a good example of rank deficient matrix, V, based on our example before:

You can clearly see that v2=2*v1 and hence

C. The critical concept to grasp is that a Low Rank Matrix is characterized by having a significantly smaller number of linearly independent rows or columns compared to the total number of rows or columns. In mathematical notation:

Now that we’ve established an understanding of what a low-rank matrix is, let’s explore two more key concepts:

3. Rank of matrix multiplication

If we have 2 matrices A and B with rankA and rankB , respectively, then the rank of the matrix multiplication is constrained by the minimum of their individual ranks. In mathematical notation:

4. Low Rank Matrix Decomposition

A Matrix V with rank r can be represented as a multiplication of 2 smaller matrices:

• is the original matrix of size m×n.
• C is a matrix of size m×r.
• F is a matrix of size r×n.

Then V can be factorized into the product of Cand F, where r is the rank of V [here is a proof that every finite matrix has rank decomposition, techniques like SVD (singular value decomposition) can be used for that].

With that, we covered the most important concepts to understand LoRA So let’s dive into Lora and explore how it leverages these principles in the context of fine-tuning large AI models.

# Revisiting Fine-Tuning

Fine-tuning is the method through which we input data into our network, such as an LLM, and then adjust the weights using the weight updates obtained through backpropagation. Think of it as a training process, but we start with pre-trained weights as our initial point.

We begin by feeding data through the network. Afterwards, we compute weight updates (referred to as deltas) using the backpropagation process. These update weights are then merged with the existing network weights to yield new weights. We repeat this process iteratively until we achieve the desired outcome.

# LoRA

So where does LoRA fit into this?

Numerous prior studies have demonstrated that pretrained models exhibit remarkably low intrinsic dimensionality. In simpler terms, they can be accurately represented using significantly fewer dimensions or parameters than they originally possess.

In the LoRA paper, they hypothesize that the change in weights during model adaptation (training) also have a low intrinsic rank/dimension. That means that these delta weights are low rank matricesmeaning that we don’t need all of the weights parameters to describe everything that’s going on.

So they use the matrix decomposition to represent this very large matrix as a potentially smaller combination of matrices.

These matrices can be much smaller than the original matrix but represent the same thing.

Why does this make sense?
Large models are trained to capture the general representation of their domain (language for LLMs, audio + language for models like Whisper, and vision for image generation models). These models capture a variety of features which allow them to be used for diverse tasks with reasonable zero-shot accuracy. However, when adapting such a model to a specific task or dataset, only a few features need to be emphasized or re-learnt. This means that the update matrix (ΔW) can be a low-rank matrix.

# Methodology

The technique is using decomposition to represent the change in weights ΔW as a product of 2 low rank matrices B and A that are lower in dimension.

Now, during fine-tuning all of the original model parameters are frozen and then a pair of small rank decomposition matrices are trained.

At inference the two low rank matrices are multiplied together to create a matrix with the same dimension as the original weights. You then add this to the original weights and replace them in the model with these updates values, and now you have a model with the updated values!

# What are the main advantages of LoRA?

## Significant Reduction in Computational Complexity.

Let’s assume we have a matrix of the base transformer model (as presented in the Attention Is All You Need paper)

• The transformer’s weights have dimensions d*k=512*64
• So each weights matrix have 512*64= 32,768 trainable parameters
• If we use LoRA with rank = 8 we instead train 2 smaller dimension matrices A and B.
• A has a dimension of r*k=8*64=512 parameters
• B has a dimension of d*r=512*8=4096 trainable parameters
• And there you go — 86% reduction in parameters to train!
• This means you can (often) fine-tune your model on a single(!) GPU and avoid the need for a distributed cluster of GPUs.

LoRA allows for a pre-trained model to be shared and used to build many small modules for different tasks. This reduces the storage requirement and task-switching overhead significantly.

Consider a scenario in which you’ve trained a single foundational model for six different downstream tasks using LoRA. To transition between these tasks during both training and inference, all you need to do is substitute compact matrices.

Isn’t it intriguing that your customers can select their desired task during inference, and all you require is one foundational model that can exchange matrices based on your customers’ preferences? This is a remarkable advantage of using LoRA!

# Summary

LoRA represents a major step forward in the efficient adaptation of LLMS. Through matrix decomposition, it effectively reduces the number of trainable parameters, enabling more efficient deployment and task-switching without compromising on model quality.

As a result, it becomes a valuable asset for organizations seeking to harness the capabilities of large language models in a cost-effective and efficient manner.