[D] Preprocessing of Wikipedia Dumps for Language Modeling from Scratch | allainews.com

Jan. 23, 2022, 4:40 p.m. | /u/optimized-adam

Machine Learning www.reddit.com

I want to train a language model from scratch on wikipedia dumps of a language, say French. I download the dumps and extract them using the wikiextractor tool. I lower-case everything but keep all the accents, since they are important for French. So far so good, but now it gets blurry.

There is very little information about the specifics of preprocessing people are applying to the dumps before training tokenizers and feeding the data into the model.

How are section …

language machinelearning modeling wikipedia

More from www.reddit.com / Machine Learning

[D] tutorial on how to build streaming ML applications 14 hours ago | www.reddit.com

machinelearning

[D] Why is R^2 so crazy? 14 hours ago | www.reddit.com

baseball games good labels +5

[D] Preserving spatial distribution of data during data splitting 19 hours ago | www.reddit.com

data dataset distribution machinelearning +6

[N] Snowflake releases open (Apache 2.0) 128x3B MoE model 19 hours ago | www.reddit.com

apache apache 2.0 machinelearning moe +2

[D] Why would such a simple sentence break an LLM? 20 hours ago | www.reddit.com

copilot disadvantages german gpt4 +7

[R] Speaker diarization 21 hours ago | www.reddit.com

api assemblyai aws box +12

[R] I made an app to predict ICML paper acceptance from reviews 1 day ago | www.reddit.com

analysis conferences iclr machinelearning +6

[R] SpaceByte: Towards Deleting Tokenization from Large Language Modeling - Rice University 2024 - Practically … 1 day ago | www.reddit.com

abstract machinelearning

[D] Keeping track of models and their associated metadata. 1 day, 2 hours ago | www.reddit.com

industry machinelearning metadata project +1

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

View on ai-jobs.net

Data Management Associate

@ EcoVadis | Ebène, Mauritius

View on ai-jobs.net

Senior Data Engineer

@ Telstra | Telstra ICC Bengaluru

View on ai-jobs.net