101 Billion Arabic Words Dataset | allainews.com

May 6, 2024, 4:47 a.m. | Manel Aloui, Hasna Chouikhi, Ghaith Chaabane, Haithem Kchaou, Chehir Dhaouadi

cs.CL updates on arXiv.org arxiv.org

arXiv:2405.01590v1 Announce Type: new
Abstract: In recent years, Large Language Models have revolutionized the field of natural language processing, showcasing an impressive rise predominantly in English-centric domains. These advancements have set a global benchmark, inspiring significant efforts toward developing Arabic LLMs capable of understanding and generating the Arabic language with remarkable accuracy. Despite these advancements, a critical challenge persists: the potential bias in Arabic LLMs, primarily attributed to their reliance on datasets comprising English data that has been translated into …

abstract accuracy arabic arxiv benchmark billion cs.cl dataset domains english global language language models language processing large language large language models llms natural natural language natural language processing processing set type understanding words

More from arxiv.org / cs.CL updates on arXiv.org

RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models 3 hours ago | arxiv.org

abstract arxiv become contents +17

Temporal Knowledge Question Answering via Abstract Reasoning Induction 3 hours ago | arxiv.org

abstract arxiv cs.ai cs.cl +8

Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning 3 hours ago | arxiv.org

abstract application arxiv capabilities +19

ANALOGYKB: Unlocking Analogical Reasoning of Language Models with A Million-scale Knowledge Base 3 hours ago | arxiv.org

abstract arxiv cognitive cs.ai +23

FOLIO: Natural Language Reasoning with First-Order Logic 3 hours ago | arxiv.org

abstract arxiv benchmarks capabilities +21

Data-Informed Global Sparseness in Attention Mechanisms for Deep Neural Networks 3 hours ago | arxiv.org

arxiv attention attention mechanisms cs.cl +6

SynDy: Synthetic Dynamic Dataset Generation Framework for Misinformation Tasks 3 hours ago | arxiv.org

abstract arxiv capabilities communities +17

A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers 3 hours ago | arxiv.org

abstract academia accessibility advances +28

COGNET-MD, an evaluation framework and dataset for Large Language Model benchmarks in the medical domain 3 hours ago | arxiv.org

abstract advanced art artificial +25

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

View on ai-jobs.net

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

View on ai-jobs.net

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

View on ai-jobs.net

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

View on ai-jobs.net

Lead Developer (AI)

@ Cere Network | San Francisco, US

View on ai-jobs.net