March 25, 2024, 4:46 a.m. | Enora Rice, Ali Marashian, Luke Gessler, Alexis Palmer, Katharina von der Wense

cs.CL updates on arXiv.org arxiv.org

arXiv:2403.14840v1 Announce Type: new
Abstract: Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes. This is a core task in language documentation, and NLP systems have the potential to dramatically speed up this process. But in typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to train high quality models. However, translation data is often much more abundant, and, in this work, we present …

abstract arxiv canonical core cs.cl data documentation forms language nlp nlp systems process segmentation speed standard systems training training data translation type words

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne