June 7, 2024, 4:43 a.m. | Ryumei Nakada, Yichen Xu, Lexin Li, Linjun Zhang

cs.LG updates on arXiv.org arxiv.org

arXiv:2406.03628v1 Announce Type: cross
Abstract: Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. In this article, we introduce OPAL (\textbf{O}versam\textbf{P}ling with \textbf{A}rtificial \textbf{L}LM-generated data), a systematic oversampling approach that leverages the capabilities of large language models (LLMs) to generate high-quality synthetic data for minority groups. Recent studies on synthetic data generation using deep generative …

abstract article arxiv challenges correlations cs.lg data data science instances llms machine machine learning machine learning and data science oversampling practical science stat.ml synthetic theory type

