June 11, 2024, 4:41 a.m. | Zichun Yu, Spandan Das, Chenyan Xiong

cs.CL updates on arXiv.org arxiv.org

arXiv:2406.06046v1 Announce Type: new
Abstract: Pretraining data selection has the potential to improve language model pretraining efficiency by utilizing higher-quality data from massive web data corpora. Current data selection methods, which rely on either hand-crafted rules or larger reference models, are conducted statically and do not capture the evolving data preferences during pretraining. In this paper, we introduce model-aware data selection with data influence models (MATES), where a data influence model continuously adapts to the evolving data preferences of the …

arxiv cs.cl cs.lg data influence pretraining type

