all AI news
From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization
April 5, 2024, 4:47 a.m. | Botond Barta, Dorina Lakatos, Attila Nagy, Mil\'an Konor Nyist, Judit \'Acs
cs.CL updates on arXiv.org arxiv.org
Abstract: Training summarization models requires substantial amounts of training data. However for less resourceful languages like Hungarian, openly available models and datasets are notably scarce. To address this gap our paper introduces HunSum-2 an open-source Hungarian corpus suitable for training abstractive and extractive summarization models. The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning, preprocessing and deduplication. In addition to abstractive summarization we generate sentence-level labels for extractive summarization using sentence …
abstract arxiv building cs.cl data datasets gap however hungarian languages openly paper summarization training training data type
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Founding AI Engineer, Agents
@ Occam AI | New York
AI Engineer Intern, Agents
@ Occam AI | US
AI Research Scientist
@ Vara | Berlin, Germany and Remote
Data Architect
@ University of Texas at Austin | Austin, TX
Data ETL Engineer
@ University of Texas at Austin | Austin, TX
Consultant Senior Power BI & Azure - CDI - H/F
@ Talan | Lyon, France