all AI news
Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed
June 10, 2024, 4:45 a.m. | Savelii Chezhegov, Yaroslav Klyukin, Andrei Semenov, Aleksandr Beznosikov, Alexander Gasnikov, Samuel Horv\'ath, Martin Tak\'a\v{c}, Eduard Gorbunov
cs.LG updates on arXiv.org arxiv.org
Abstract: Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the high-probability convergence of AdaGrad/Adam has not been studied in this case. In this work, we prove that AdaGrad (and its …
abstract adam arxiv convergence cs.lg deep learning good gradient language language models large language large language models later math.oc modern noise ones probability stochastic training type
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
AI Focused Biochemistry Postdoctoral Fellow
@ Lawrence Berkeley National Lab | Berkeley, CA
Senior Data Engineer
@ Displate | Warsaw
Solutions Architect
@ PwC | Bucharest - 1A Poligrafiei Boulevard
Research Fellow (Social and Cognition Factors, CLIC)
@ Nanyang Technological University | NTU Main Campus, Singapore
Research Aide - Research Aide I - Department of Psychology
@ Cornell University | Ithaca (Main Campus)
Technical Architect - SMB/Desk
@ Salesforce | Ireland - Dublin