Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed | allainews.com

June 10, 2024, 4:45 a.m. | Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horv\'ath, Martin Tak\'a\v{c}, Eduard Gorbunov

cs.LG updates on arXiv.org arxiv.org

arXiv:2406.04443v1 Announce Type: new
Abstract: Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the high-probability convergence of AdaGrad/Adam has not been studied in this case. In this work, we prove that AdaGrad (and its …

abstract adam arxiv convergence cs.lg deep learning good gradient language language models large language large language models later math.oc modern noise ones probability stochastic training type

More from arxiv.org / cs.LG updates on arXiv.org

Bayesian identification of nonseparable Hamiltonians with multiplicative noise using deep learning and reduced-order modeling an hour ago | arxiv.org

abstract arxiv bayesian cs.lg +17

MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning an hour ago | arxiv.org

abstract analysis arxiv cs.cv +16

Self-Supervised Detection of Perfect and Partial Input-Dependent Symmetries an hour ago | arxiv.org

arxiv cs.cv cs.lg detection +3

MixerFlow: MLP-Mixer meets Normalising Flows an hour ago | arxiv.org

abstract architectures arxiv context +15

Machine Learning-Enabled Software and System Architecture Frameworks an hour ago | arxiv.org

abstract architecture arxiv concerns +22

Efficient Interaction-Aware Interval Analysis of Neural Network Feedback Loops an hour ago | arxiv.org

abstract analysis arxiv cs.lg +19

Kernelised Normalising Flows an hour ago | arxiv.org

abstract architecture arxiv capabilities +14

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism an hour ago | arxiv.org

abstract arxiv class cs.dc +25

Reinforcement Learning in Credit Scoring and Underwriting an hour ago | arxiv.org

abstract action adapt arxiv +17

AI Focused Biochemistry Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, CA

View on ai-jobs.net

Senior Data Engineer

@ Displate | Warsaw

View on ai-jobs.net

Solutions Architect

@ PwC | Bucharest - 1A Poligrafiei Boulevard

View on ai-jobs.net

Research Fellow (Social and Cognition Factors, CLIC)

@ Nanyang Technological University | NTU Main Campus, Singapore

View on ai-jobs.net

Research Aide - Research Aide I - Department of Psychology

@ Cornell University | Ithaca (Main Campus)

View on ai-jobs.net

Technical Architect - SMB/Desk

@ Salesforce | Ireland - Dublin

View on ai-jobs.net