Web: https://www.reddit.com/r/MachineLearning/comments/saxnbt/d_preprocessing_of_wikipedia_dumps_for_language/

Jan. 23, 2022, 4:40 p.m. | /u/optimized-adam

Machine Learning reddit.com

I want to train a language model from scratch on wikipedia dumps of a language, say French. I download the dumps and extract them using the wikiextractor tool. I lower-case everything but keep all the accents, since they are important for French. So far so good, but now it gets blurry.

There is very little information about the specifics of preprocessing people are applying to the dumps before training tokenizers and feeding the data into the model.

  1. How are section …

language machinelearning modeling wikipedia

Predictive Ecology Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, CA

Data Analyst, Patagonia Action Works

@ Patagonia | Remote

Data & Insights Strategy & Innovation General Manager

@ Chevron Services Company, a division of Chevron U.S.A Inc. | Houston, TX

Faculty members in Research areas such as Bayesian and Spatial Statistics; Data Privacy and Security; AI/ML; NLP; Image and Video Data Analysis

@ Ahmedabad University | Ahmedabad, India

Director, Applied Mathematics & Computational Research Division

@ Lawrence Berkeley National Lab | Berkeley, Ca

Business Data Analyst

@ MainStreet Family Care | Birmingham, AL