March 25, 2024, 4:42 a.m. | Jonathan Katzy, R\u{a}zvan-Mihai Popescu, Arie van Deursen, Maliheh Izadi

cs.LG updates on arXiv.org arxiv.org

arXiv:2403.15230v1 Announce Type: cross
Abstract: Does the training of large language models potentially infringe upon code licenses? Furthermore, are there any datasets available that can be safely used for training these models without violating such licenses? In our study, we assess the current trends in the field and the importance of incorporating code into the training of large language models. Additionally, we examine publicly available datasets to see whether these models can be trained on them without the risk of …

abstract arxiv code cs.lg cs.se current datasets exploratory investigation language language model language models language model training large language large language model large language models license study training training datasets trends type

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Lead Data Modeler

@ Sherwin-Williams | Cleveland, OH, United States