May 19, 2023, 1:45 p.m. | /u/dxg39

Machine Learning www.reddit.com

I made a small project for testing if different popular tokenizers are lossless. I.e. do they give back the original input after encode+decode.

Turns out most of them are not.

https://github.com/skeskinen/hf-tokenizer-testing

Does it matter if tokenizers can/can't reproduce the input exactly? I guess this is subjective, but I'd say it's at least a nice feature. A feature that (perhaps surprisingly?) most tokenizers out there don't seem to have.

I wrote this for myself on a quest to find a tokenizer …

decode encode feature gpt least machinelearning nice popular project small testing

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Data Engineer

@ Cint | Gurgaon, India

Data Science (M/F), setor automóvel - Aveiro

@ Segula Technologies | Aveiro, Portugal