Aug. 24, 2023, 3:15 p.m. | /u/Tomarchelone

Machine Learning www.reddit.com

Tokenizers library used to be very nice to use. It had one main class `Tokenizer` with all of the parameters in its constructor and with all necessary methods like `.train()`, `.encode()` and `.decode()` at hand. It provided reasonable defaults and allowed for customization if needed.

Now it is a complete mess. To train a tokenizer I now have to create instances of, like, 5-6 classes: `PreTokenizer`, `Model`, `Tokenizer`, `Trainer`, `Decoder`... It is quite difficult to understand what variants of those …

api customization decode encode huggingface library machinelearning nice

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

MLOps Engineer - Hybrid Intelligence

@ Capgemini | Madrid, M, ES

Analista de Business Intelligence (Industry Insights)

@ NielsenIQ | Cotia, Brazil