Feb. 6, 2024, 5:46 a.m. | Husein Zolkepli Aisyah Razak Kamarul Adha Ariff Nazhan

cs.LG updates on arXiv.org arxiv.org

In this work, we present a comprehensive exploration of finetuning Malaysian language models, specifically Llama2 and Mistral, on embedding tasks involving negative and positive pairs. We release two distinct models tailored for Semantic Similarity and Retrieval-Augmented Generation (RAG).
For Semantic Similarity, our 600 million parameter Llama2 model outperforms OpenAI text-embedding-ada-002 across all recall@k metrics for b.cari.com.my, c.cari.com.my, Malay news, and Malaysian Twitter test sets.
In the realm of RAG models, our approach proves competitive with OpenAI text-embedding-ada-002 in the Malaysian …

ada cs.cl cs.lg embedding exploration finetuning language language models large language large language models llama2 mistral negative openai positive rag release retrieval retrieval-augmented semantic tasks text work

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US