Feb. 6, 2024, 5:46 a.m. | Husein Zolkepli Aisyah Razak Kamarul Adha Ariff Nazhan

cs.LG updates on arXiv.org arxiv.org

In this work, we present a comprehensive exploration of finetuning Malaysian language models, specifically Llama2 and Mistral, on embedding tasks involving negative and positive pairs. We release two distinct models tailored for Semantic Similarity and Retrieval-Augmented Generation (RAG).
For Semantic Similarity, our 600 million parameter Llama2 model outperforms OpenAI text-embedding-ada-002 across all recall@k metrics for b.cari.com.my, c.cari.com.my, Malay news, and Malaysian Twitter test sets.
In the realm of RAG models, our approach proves competitive with OpenAI text-embedding-ada-002 in the Malaysian …

ada cs.cl cs.lg embedding exploration finetuning language language models large language large language models llama2 mistral negative openai positive rag release retrieval retrieval-augmented semantic tasks text work

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Sr. Software Development Manager, AWS Neuron Machine Learning Distributed Training

@ Amazon.com | Cupertino, California, USA