Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models | allainews.com

March 1, 2024, 5:43 a.m. | Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti

cs.LG updates on arXiv.org arxiv.org

arXiv:2402.19449v1 Announce Type: new
Abstract: Adam has been shown to outperform gradient descent in optimizing large language transformers empirically, and by a larger margin than on other tasks, but it is unclear why this happens. We show that the heavy-tailed class imbalance found in language modeling tasks leads to difficulties in the optimization dynamics. When training with gradient descent, the loss associated with infrequent words decreases slower than the loss associated with frequent ones. As most samples come from relatively …

abstract adam arxiv class cs.cl cs.lg found gradient language language models large language leads math.oc modeling show stat.ml tasks transformers type

More from arxiv.org / cs.LG updates on arXiv.org

Learning epidemic trajectories through Kernel Operator Learning: from modelling to optimal control 20 hours ago | arxiv.org

abstract architectures arxiv control +16

RTA-Former: Reverse Transformer Attention for Polyp Segmentation 20 hours ago | arxiv.org

abstract architectures arxiv attention +21

Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data 20 hours ago | arxiv.org

abstract applications arxiv capacity +22

AdaMR: Adaptable Molecular Representation for Unified Pre-training Strategy 20 hours ago | arxiv.org

abstract arxiv cs.ai cs.lg +14

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages 20 hours ago | arxiv.org

arxiv classification cs.cl cs.lg +5

Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis … 20 hours ago | arxiv.org

abstract architecture arxiv benchmarking +18

Agglomerative Federated Learning: Empowering Larger Model Training via End-Edge-Cloud Collaboration 20 hours ago | arxiv.org

abstract artificial artificial intelligence arxiv +20

Describing Differences in Image Sets with Natural Language 20 hours ago | arxiv.org

abstract arxiv cs.cl cs.cv +16

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts 20 hours ago | arxiv.org

abstract arxiv challenge cs.ai +24

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Tableau/PowerBI Developer (A.Con)

@ KPMG India | Bengaluru, Karnataka, India

View on ai-jobs.net

Software Engineer, Backend - Data Platform (Big Data Infra)

@ Benchling | San Francisco, CA

View on ai-jobs.net