Nov. 2, 2023, 5 p.m. | /u/skelly0311

Machine Learning www.reddit.com

I'm confused as to why people would only use a subset of weights as learnable parameters for LoRA. If you are only using attention weights as update params, you still need the decomposed weights for the other layers to get the derivative of the loss with respect to the attention weights. That's how the chain rule works, so I don't see how it would help with memory consumption. Is there something I'm missing here?

attention benefits lora loss machinelearning parameters params people

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US

Research Engineer

@ Allora Labs | Remote

Ecosystem Manager

@ Allora Labs | Remote

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US