[R] Why is AdamW often superior to Adam with L2-Regularization in practice? The answer may lie in how weight decay balances updates across layers. | allainews.com

Oct. 8, 2023, 3:17 p.m. | /u/PlantsAreSoooAwesome

Machine Learning www.reddit.com

A recent [work](https://arxiv.org/abs/2305.17212) explores how weight decay controls the effective learning rate for different layers and neurons. This rotational behavior drastically differs between Adam with L2 regularization compared to Adam with decoupled weight decay (AdamW) and seems to be the reason AdamW performs better in practice. It could also explain why normalization methods like weight standardization work so well and irregular rotational behavior could contribute to the need for a learning rate warmup.

**Full Abstract:**
Weight decay can significantly impact …

abstract dynamics effects equilibrium gradient impact machinelearning networks neural networks optimization rotation state updates vector

More from www.reddit.com / Machine Learning

[D] Why do juniors (undergraduates or first- to second-year PhD students) have so many papers … 52 minutes ago | www.reddit.com

academic conferences etc hello +12

[R] Training-free Graph Neural Networks and the Power of Labels as Features 12 hours ago | www.reddit.com

features free graph graph neural networks +6

[D] Modern best coding practices for Pytorch (for research)? 15 hours ago | www.reddit.com

coding config example good +14

[R] Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic … 17 hours ago | www.reddit.com

breaking data machinelearning model collapse +3

[P] I reproduced Anthropic's recent interpretability research 18 hours ago | www.reddit.com

anthropic attention basic capabilities +8

[R] KAN: Kolmogorov-Arnold Networks 19 hours ago | www.reddit.com

abstract every function functions +11

[D] Looking for a recent study/paper/article that showed that an alternate model with a similar … 19 hours ago | www.reddit.com

article conversation machinelearning nothing +4

[2404.10667] VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time 20 hours ago | www.reddit.com

audio generated machinelearning vasa +1

[D] Is RPE still a valid approach, or is RoPE entirely superior? 1 day ago | www.reddit.com

attention datasets embed information +8

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Business Data Scientist, gTech Ads

@ Google | Mexico City, CDMX, Mexico

View on ai-jobs.net

Lead, Data Analytics Operations

@ Zocdoc | Pune, Maharashtra, India

View on ai-jobs.net