all AI news
[D] Preprocessing of Wikipedia Dumps for Language Modeling from Scratch
Jan. 23, 2022, 4:40 p.m. | /u/optimized-adam
Machine Learning www.reddit.com
I want to train a language model from scratch on wikipedia dumps of a language, say French. I download the dumps and extract them using the wikiextractor tool. I lower-case everything but keep all the accents, since they are important for French. So far so good, but now it gets blurry.
There is very little information about the specifics of preprocessing people are applying to the dumps before training tokenizers and feeding the data into the model.
- How are section …
More from www.reddit.com / Machine Learning
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Management Associate
@ EcoVadis | Ebène, Mauritius
Senior Data Engineer
@ Telstra | Telstra ICC Bengaluru