all AI news
TokenMonster Ungreedy ~ 35% faster inference and 35% increased context-length for large language models (compared to tiktoken). Benchmarks included
June 4, 2023, 2:33 p.m. | /u/Pan000
Natural Language Processing www.reddit.com
TokenMonster is an ungreedy tokenizer and vocabulary builder, outperforming tiktoken by 35%. In fact, TokenMonster's smallest 24000 vocabulary consistently uses less tokens than tiktoken's largest 100256 vocabulary to tokenize the same text. Save the tokens! [See benchmark](https://github.com/alasdairforsythe/tokenmonster/blob/main/benchmark).
Given a text dataset, a vocabulary-size and a maximum-token-length, TokenMonster selects the tokens that optimally represent your dataset at that vocabulary size. It can do this at reasonable speed (within 24 hours) on server hardware, at a cost of around …
binary characters context etc faster features gpu html languagetechnology per spaces speech speed text text generation words
More from www.reddit.com / Natural Language Processing
What Do You Love About NLP?
1 day, 1 hour ago |
www.reddit.com
How to Install and Deploy LLaMA 3 Into Production
1 day, 18 hours ago |
www.reddit.com
The Languages AI Is Leaving Behind
4 days, 17 hours ago |
www.reddit.com
Feeling so inferior in the NLP job market.
5 days, 14 hours ago |
www.reddit.com
NLP: building a sentiment model
5 days, 15 hours ago |
www.reddit.com
ReFT: Representation Finetuning for Language Models
1 week, 6 days ago |
www.reddit.com
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Engineer
@ Parker | New York City
Sr. Data Analyst | Home Solutions
@ Three Ships | Raleigh or Charlotte, NC