Oct. 8, 2023, 3:17 p.m. | /u/PlantsAreSoooAwesome

Machine Learning www.reddit.com

A recent [work](https://arxiv.org/abs/2305.17212) explores how weight decay controls the effective learning rate for different layers and neurons. This rotational behavior drastically differs between Adam with L2 regularization compared to Adam with decoupled weight decay (AdamW) and seems to be the reason AdamW performs better in practice. It could also explain why normalization methods like weight standardization work so well and irregular rotational behavior could contribute to the need for a learning rate warmup.

**Full Abstract:**
Weight decay can significantly impact …

abstract dynamics effects equilibrium gradient impact machinelearning networks neural networks optimization rotation state updates vector

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Business Data Scientist, gTech Ads

@ Google | Mexico City, CDMX, Mexico

Lead, Data Analytics Operations

@ Zocdoc | Pune, Maharashtra, India