Web: http://arxiv.org/abs/2111.06467

Jan. 14, 2022, 2:11 a.m. | Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy Coenen, Sebastian Gehrmann

cs.LG updates on arXiv.org arxiv.org

NLP researchers need more, higher-quality text datasets. Human-labeled
datasets are expensive to collect, while datasets collected via automatic
retrieval from the web such as WikiBio are noisy and can include undesired
biases. Moreover, data sourced from the web is often included in datasets used
to pretrain models, leading to inadvertent cross-contamination of training and
test sets. In this work we introduce a novel method for efficient dataset
curation: we use a large language model to provide seed generations to human
raters, thereby changing dataset authoring from a writing task to …

ai arxiv case study collaborative datasets human study text

Statistics and Computer Science Specialist

@ Hawk-Research | Remote

Data Scientist, Credit/Fraud Strategy

@ Fora Financial | New York City

Postdoctoral Research Associate - Biomedical Natural Language Processing and Deep Learning

@ Oak Ridge National Laboratory - Oak Ridge, TN | Oak Ridge, TN, United States

Senior Machine Learning / Computer Vision Engineer

@ Glass Imaging | Los Altos, CA

Research Scientist in Biomedical Natural Language Processing and Deep Learning

@ Oak Ridge National Laboratory | Oak Ridge, TN

W3-Professorship for Intelligent Energy Management

@ Universität Bayreuth | Bayreuth, Germany