Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models | allainews.com

March 19, 2024, 4:53 a.m. | Mohamed Taher Alrefaie, Nour Eldin Morsy, Nada Samir

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.11130v1 Announce Type: new
Abstract: This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the effectiveness of four tokenizers across various tasks, including News Classification, Hate Speech Detection, Sentiment Analysis, and Natural Language Inference. Leveraging a diverse set of vocabulary sizes, we scrutinize the intricate interplay between tokenization approaches and model performance. The results reveal that Byte …

abstract arabic arxiv classification cs.cl impact investigation language language models language processing natural natural language natural language processing paper performance processing strategies tasks tokenization type

More from arxiv.org / cs.CL updates on arXiv.org

Sparse is Enough in Fine-tuning Pre-trained Large Language Models 13 hours ago | arxiv.org

arxiv cs.ai cs.cl cs.lg +6

On the Learnability of Watermarks for Language Models 13 hours ago | arxiv.org

abstract arxiv cs.cl cs.cr +17

StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization 13 hours ago | arxiv.org

abstract arxiv capabilities cs.ai +14

Evaluating Generative Ad Hoc Information Retrieval 13 hours ago | arxiv.org

abstract advances arxiv cs.cl +19

Language Models As Semantic Indexers 13 hours ago | arxiv.org

arxiv cs.cl cs.ir cs.lg +4

Large language models can accurately predict searcher preferences 13 hours ago | arxiv.org

abstract arxiv cs.ai cs.cl +16

On the Reliability of Watermarks for Large Language Models 13 hours ago | arxiv.org

abstract arxiv become bots +28

A Watermark for Large Language Models 13 hours ago | arxiv.org

abstract arxiv cs.cl cs.cr +16

CreoleVal: Multilingual Multitask Benchmarks for Creoles 13 hours ago | arxiv.org

abstract annotated data arxiv benchmarks +14

AI Research Scientist

@ Vara | Berlin, Germany and Remote

View on ai-jobs.net

Data Architect

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

View on ai-jobs.net

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

View on ai-jobs.net

Senior Data Engineer (m/f/d)

@ Project A Ventures | Berlin, Germany

View on ai-jobs.net

Principle Research Scientist

@ Analog Devices | US, MA, Boston

View on ai-jobs.net