April 2, 2024, 7:52 p.m. | Chen Yang, Junzhuo Li, Xinyao Niu, Xinrun Du, Songyang Gao, Haoran Zhang, Zhaoliang Chen, Xingwei Qu, Ruibin Yuan, Yizhi Li, Jiaheng Liu, Stephen W. H

cs.CL updates on arXiv.org arxiv.org

arXiv:2404.01204v1 Announce Type: new
Abstract: Uncovering early-stage metrics that reflect final model performance is one core principle for large-scale pretraining. The existing scaling law demonstrates the power-law correlation between pretraining loss and training flops, which serves as an important indicator of the current training state for large language models. However, this principle only focuses on the model's compression properties on the training data, resulting in an inconsistency with the ability improvements on the downstream tasks. Some follow-up works attempted to …

abstract analysis arxiv capability core correlation cs.cl current language language model large language large language model law line loss metrics performance power power-law pretraining scale scaling scaling law stage state streaming training type

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

ML/AI Engineer / NLP Expert - Custom LLM Development (x/f/m)

@ HelloBetter | Remote

Doctoral Researcher (m/f/div) in Automated Processing of Bioimages

@ Leibniz Institute for Natural Product Research and Infection Biology (Leibniz-HKI) | Jena

Seeking Developers and Engineers for AI T-Shirt Generator Project

@ Chevon Hicks | Remote

Data Scientist, Mid

@ Booz Allen Hamilton | DEU, Stuttgart (Kurmaecker St)

Tech Excellence Data Scientist

@ Booz Allen Hamilton | Undisclosed Location - USA, VA, Mclean