all AI news
Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models
March 19, 2024, 4:53 a.m. | Mohamed Taher Alrefaie, Nour Eldin Morsy, Nada Samir
cs.CL updates on arXiv.org arxiv.org
Abstract: This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the effectiveness of four tokenizers across various tasks, including News Classification, Hate Speech Detection, Sentiment Analysis, and Natural Language Inference. Leveraging a diverse set of vocabulary sizes, we scrutinize the intricate interplay between tokenization approaches and model performance. The results reveal that Byte …
abstract arabic arxiv classification cs.cl impact investigation language language models language processing natural natural language natural language processing paper performance processing strategies tasks tokenization type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Data Engineer (m/f/d)
@ Project A Ventures | Berlin, Germany
Principle Research Scientist
@ Analog Devices | US, MA, Boston