all AI news
The GINCO Training Dataset for Web Genre Identification of Documents Out in the Wild. (arXiv:2201.03857v1 [cs.CL])
Jan. 12, 2022, 2:10 a.m. | Taja Kuzman, Peter Rupnik, Nikola Ljubešić
cs.CL updates on arXiv.org arxiv.org
This paper presents a new training dataset for automatic genre identification
GINCO, which is based on 1,125 crawled Slovenian web documents that consist of
650 thousand words. Each document was manually annotated for genre with a new
annotation schema that builds upon existing schemata, having primarily clarity
of labels and inter-annotator agreement in mind. The dataset consists of
various challenges related to web-based data, such as machine translated
content, encoding errors, multiple contents presented in one document etc.,
enabling evaluation …
More from arxiv.org / cs.CL updates on arXiv.org
Jobs in AI, ML, Big Data
Senior ML Researcher - 3D Geometry Processing | 3D Shape Generation | 3D Mesh Data
@ Promaton | Europe
Research Assistant/Associate, Health Data Science [LKCMedicine]
@ Nanyang Technological University | NTU Novena Campus, Singapore
Senior Machine Learning Engineer, Portfolio ML
@ Affirm | Remote Canada
[Sessional Lecturer] Foundations of Data Analytics and Machine Learning - APS1070
@ University of Toronto | Toronto, ON, CA
Senior Data Scientist
@ Prosper | United States
Data Analyst
@ ZF Friedrichshafen AG | Coimbatore, TN, IN, 641659