Youssef Hosni (@youssefhosni95): "Everyone is fine-tuning LLMs. Almost nobody understands what is actually being updated inside the model. Here are 5 techniques that change how you think about model adaptation, and what each one is actually doing to the weights: 1./ LoRA

Everyone is fine-tuning LLMs.

Almost nobody understands what is actually being updated inside the model.

Here are 5 techniques that change how you think about model adaptation, and what each one is actually doing to the weights:

1./ LoRA - Learn the update, not the weights

The pretrained weight W is frozen. Completely untouched.

Instead of updating W directly, two small matrices are trained =>

A ∈ ℝʳˣᵈ and B ∈ ℝᵈˣʳ, where r ≪ d

The weight update is: ΔW = BA Effective weight: W' = W + BA

The entire adaptation happens in a tiny low-rank space. W never changes.

2./ LoRA-FA - What if we freeze even more?

Same structure as LoRA. One change.

A is frozen alongside W. Only B is trained. Effective weight: W' = W + BA (A is fixed)

Half the trainable matrices of LoRA. Same core idea. Fewer parameters.

3./ VeRA - What if the matrices don't need to be learned at all?

This is where it gets interesting.

A and B are both frozen, and randomly initialized. What gets trained are just two tiny scaling vectors =>

b ∈ ℝʳ and d ∈ ℝʳ

Instead of learning the low-rank matrices themselves, VeRA keeps them frozen and learns small scaling vectors that modulate their contribution.

Initialization => b = 0, d = 1

You're not learning matrices. You're learning how to scale them.

One of the most parameter-efficient techniques on this list.

4./ Delta-LoRA - What if W itself learns from the low-rank updates?

This one is fundamentally different.

Unlike standard LoRA, the base weight W is not fully frozen. It is updated through low-rank delta propagation at every step =>

W^(t+1) = W^t + c(B_(t+1)A_(t+1) − B_t A_t)

Where c is a scaling factor.

A and B are trainable. W evolves, but guided entirely by low-rank changes.

5./ LoRA+ - Same structure. Smarter learning rates.

Identical to LoRA, freeze W, train A and B.

One change => B is assigned a larger learning rate than A. η_B > η_A

A ← A − η_A · ∂J/∂A B ← B − η_B · ∂J/∂B

A small optimization change that can make LoRA training more effective.

The core idea running through all five:

You do not always need full fine-tuning to adapt a model.

Same goal. Five different answers to the same question:

=> What is the minimum we actually need to learn?

That is the core idea behind parameter-efficient fine-tuning.

And now you know what is actually happening inside the model.

May 29

8:26 AM