Feb. 6, 2024, 5:42 a.m. | Peiqi Wang Yikang Shen Zhen Guo Matthew Stallone Yoon Kim Polina Golland Rameswar Panda

cs.LG updates on arXiv.org arxiv.org

We aim to select data subsets for the fine-tuning of large language models to more effectively follow instructions. Prior work has emphasized the importance of diversity in dataset curation but relied on heuristics such as the number of tasks. In this paper, we use determinantal point processes to capture the diversity and quality of instruction tuning datasets for subset selection. We propose to measure dataset diversity with log determinant distance that is the distance between the dataset of interest and …

aim cs.lg curation data dataset datasets diversity fine-tuning heuristics importance language language models large language large language models measurement paper prior processes tasks work

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Sr. VBI Developer II

@ Atos | Texas, US, 75093

Wealth Management - Data Analytics Intern/Co-op Fall 2024

@ Scotiabank | Toronto, ON, CA