March 1, 2024, 5:49 a.m. | Jiantao Qiu, Haijun Lv, Zhenjiang Jin, Rui Wang, Wenchang Ning, Jia Yu, ChaoBin Zhang, Pei Chu, Yuan Qu, Runyu Peng, Zhiyuan Zeng, Huanze Tang, Ruilia

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.19282v1 Announce Type: new
Abstract: This paper presents WanJuan-CC, a safe and high-quality open-sourced English webtext dataset derived from Common Crawl data. The study addresses the challenges of constructing large-scale pre-training datasets for language models, which require vast amounts of high-quality data. A comprehensive process was designed to handle Common Crawl data, including extraction, heuristic rule filtering, fuzzy deduplication, content safety filtering, and data quality filtering. From approximately 68 billion original English documents, we obtained 2.22T Tokens of safe data …

abstract arxiv challenges cs.cl data dataset datasets english language language models paper pre-training process quality quality data scale study training training datasets type vast

Senior Machine Learning Engineer

@ GPTZero | Toronto, Canada

Customer Data Analyst with Spanish

@ Michelin | Voluntari

HC Data Analyst - Senior

@ Leidos | 1662 Intelligence Community Campus - Bethesda MD

Healthcare Research & Data Analyst- Infectious, Niche, Rare Disease

@ Clarivate | Remote (121- Massachusetts)

Data Analyst (maternity leave cover)

@ Clarivate | R155-Belgrade

Sales Enablement Data Analyst (Remote)

@ CrowdStrike | USA TX Remote