June 4, 2023, 2:33 p.m. | /u/Pan000

Natural Language Processing www.reddit.com

**From the** [**GitHub**](https://github.com/alasdairforsythe/tokenmonster)**:**

TokenMonster is an ungreedy tokenizer and vocabulary builder, outperforming tiktoken by 35%. In fact, TokenMonster's smallest 24000 vocabulary consistently uses less tokens than tiktoken's largest 100256 vocabulary to tokenize the same text. Save the tokens! [See benchmark](https://github.com/alasdairforsythe/tokenmonster/blob/main/benchmark).

Given a text dataset, a vocabulary-size and a maximum-token-length, TokenMonster selects the tokens that optimally represent your dataset at that vocabulary size. It can do this at reasonable speed (within 24 hours) on server hardware, at a cost of around …

binary characters context etc faster features gpu html languagetechnology per spaces speech speed text text generation words

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Engineer

@ Parker | New York City

Sr. Data Analyst | Home Solutions

@ Three Ships | Raleigh or Charlotte, NC