Web: http://arxiv.org/abs/2201.11853

Jan. 31, 2022, 2:11 a.m. | Heting Liu, Zhichao Li, Cheng Tan, Rongqiu Yang, Guohong Cao, Zherui Liu, Chuanxiong Guo

cs.LG updates on arXiv.org arxiv.org

Graphics processing units (GPUs) are the de facto standard for processing
deep learning (DL) tasks. Meanwhile, GPU failures, which are inevitable, cause
severe consequences in DL tasks: they disrupt distributed trainings, crash
inference services, and result in service level agreement violations. To
mitigate the problem caused by GPU failures, we propose to predict failures by
using ML models. This paper is the first to study prediction models of GPU
failures under large-scale production deep learning workloads. As a starting
point, …

arxiv deep deep learning gpu learning prediction

More from arxiv.org / cs.LG updates on arXiv.org

Senior Data Engineer

@ DAZN | Hammersmith, London, United Kingdom

Sr. Data Engineer, Growth

@ Netflix | Remote, United States

Data Engineer - Remote

@ Craft | Wrocław, Lower Silesian Voivodeship, Poland

Manager, Operations Data Science

@ Binance.US | Vancouver

Senior Machine Learning Researcher for Copilot

@ GitHub | Remote - Europe

Sr. Marketing Data Analyst

@ HoneyBook | San Francisco, CA