all AI news
ParaCotta: Synthetic Multilingual Paraphrase Corpora from the Most Diverse Translation Sample Pair. (arXiv:2205.04651v1 [cs.CL])
May 11, 2022, 1:11 a.m. | Alham Fikri Aji, Tirana Noor Fatyanosa, Radityo Eko Prasojo, Philip Arthur, Suci Fitriany, Salma Qonitah, Nadhifa Zulfa, Tomi Santoso, Mahendra Data
cs.CL updates on arXiv.org arxiv.org
We release our synthetic parallel paraphrase corpus across 17 languages:
Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi,
Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and
Chinese. Our method relies only on monolingual data and a neural machine
translation system to generate paraphrases, hence simple to apply. We generate
multiple translation samples using beam search and choose the most lexically
diverse pair according to their sentence BLEU. We compare our generated corpus
with the \texttt{ParaBank2}. According to our evaluation, …
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Lead GNSS Data Scientist
@ Lurra Systems | Melbourne
Senior Machine Learning Engineer (MLOps)
@ Promaton | Remote, Europe
Data Analytics & Insight Specialist, Customer Success
@ Fortinet | Ottawa, ON, Canada
Account Director, ChatGPT Enterprise - Majors
@ OpenAI | Remote - Paris