all AI news
QuRating: Selecting High-Quality Data for Training Language Models
Feb. 16, 2024, 5:43 a.m. | Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen
cs.LG updates on arXiv.org arxiv.org
Abstract: Selecting high-quality pre-training data is important for creating capable language models, but existing methods rely on simple heuristics. We introduce QuRating, a method for selecting pre-training data that captures the abstract qualities of texts which humans intuitively perceive. In this paper, we investigate four qualities - writing style, required expertise, facts & trivia, and educational value. We find that LLMs are able to discern these qualities and observe that they are better at making pairwise …
abstract arxiv cs.cl cs.lg data heuristics humans language language models paper pre-training quality quality data simple training training data type writing
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Risk Management - Machine Learning and Model Delivery Services, Product Associate - Senior Associate-
@ JPMorgan Chase & Co. | Wilmington, DE, United States
Senior ML Engineer (Speech/ASR)
@ ObserveAI | Bengaluru