May 11, 2022, 1:11 a.m. | Alham Fikri Aji, Tirana Noor Fatyanosa, Radityo Eko Prasojo, Philip Arthur, Suci Fitriany, Salma Qonitah, Nadhifa Zulfa, Tomi Santoso, Mahendra Data

cs.CL updates on arXiv.org arxiv.org

We release our synthetic parallel paraphrase corpus across 17 languages:
Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi,
Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and
Chinese. Our method relies only on monolingual data and a neural machine
translation system to generate paraphrases, hence simple to apply. We generate
multiple translation samples using beam search and choose the most lexically
diverse pair according to their sentence BLEU. We compare our generated corpus
with the \texttt{ParaBank2}. According to our evaluation, …

arxiv translation

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Analytics & Insight Specialist, Customer Success

@ Fortinet | Ottawa, ON, Canada

Account Director, ChatGPT Enterprise - Majors

@ OpenAI | Remote - Paris