Feb. 26, 2024, 5:48 a.m. | Emily Reif, Crystal Qian, James Wexler, Minsuk Kahng

cs.CL updates on arXiv.org arxiv.org

arXiv:2402.14880v1 Announce Type: new
Abstract: Making sense of unstructured text datasets is perennially difficult, yet increasingly relevant with Large Language Models. Data workers often rely on dataset summaries, especially distributions of various derived features. Some features, like toxicity or topics, are relevant to many datasets, but many interesting features are domain specific: instruments and genres for a music dataset, or diseases and symptoms for a medical dataset. Accordingly, data workers often run custom analyses for each dataset, which is cumbersome …

abstract arxiv cs.ai cs.cl cs.hc data dataset datasets domain exploration features histograms language language models large language large language models making sense text topics toxicity type unstructured workers

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Risk Management - Machine Learning and Model Delivery Services, Product Associate - Senior Associate-

@ JPMorgan Chase & Co. | Wilmington, DE, United States

Senior ML Engineer (Speech/ASR)

@ ObserveAI | Bengaluru