April 5, 2024, 4:47 a.m. | Botond Barta, Dorina Lakatos, Attila Nagy, Mil\'an Konor Nyist, Judit \'Acs

cs.CL updates on arXiv.org arxiv.org

arXiv:2404.03555v1 Announce Type: new
Abstract: Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces HunSum-2 an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning, preprocessing and deduplication. In addition to abstractive summarization we generate sentence-level labels for extractive summarization using sentence …

abstract arxiv building cs.cl data datasets gap however hungarian languages openly paper summarization training training data type

Founding AI Engineer, Agents

@ Occam AI | New York

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Consultant Senior Power BI & Azure - CDI - H/F

@ Talan | Lyon, France