Jan. 14, 2022, 2:11 a.m. | Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy Coenen, Sebastian Gehrmann

cs.LG updates on arXiv.org arxiv.org

NLP researchers need more, higher-quality text datasets. Human-labeled
datasets are expensive to collect, while datasets collected via automatic
retrieval from the web such as WikiBio are noisy and can include undesired
biases. Moreover, data sourced from the web is often included in datasets used
to pretrain models, leading to inadvertent cross-contamination of training and
test sets. In this work we introduce a novel method for efficient dataset
curation: we use a large language model to provide seed generations to human …

ai arxiv case study collaborative datasets human study text

Senior Marketing Data Analyst

@ Amazon.com | Amsterdam, North Holland, NLD

Senior Data Analyst

@ MoneyLion | Kuala Lumpur, Kuala Lumpur, Malaysia

Data Management Specialist - Office of the CDO - Chase- Associate

@ JPMorgan Chase & Co. | LONDON, LONDON, United Kingdom

BI Data Analyst

@ Nedbank | Johannesburg, ZA

Head of Data Science and Artificial Intelligence (m/f/d)

@ Project A Ventures | Munich, Germany

Senior Data Scientist - GenAI

@ Roche | Hyderabad RSS