March 19, 2024, 4:53 a.m. | Mohamed Taher Alrefaie, Nour Eldin Morsy, Nada Samir

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.11130v1 Announce Type: new
Abstract: This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the effectiveness of four tokenizers across various tasks, including News Classification, Hate Speech Detection, Sentiment Analysis, and Natural Language Inference. Leveraging a diverse set of vocabulary sizes, we scrutinize the intricate interplay between tokenization approaches and model performance. The results reveal that Byte …

abstract arabic arxiv classification cs.cl impact investigation language language models language processing natural natural language natural language processing paper performance processing strategies tasks tokenization type

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Data Engineer (m/f/d)

@ Project A Ventures | Berlin, Germany

Principle Research Scientist

@ Analog Devices | US, MA, Boston