all AI news
Public AI Training Datasets Are Rife With Licensing Errors
IEEE Spectrum spectrum.ieee.org
Large language models feed on big data from publicly available training sets, but most of the sets are of doubtful legal status.
The scope of the problem has been demonstrated by the newly launched Data Provenance Initiative, which brings together a multi-institutional team of machine-learning and legal experts led by researchers at the Massachusetts Institute of Technology and Cohere for AI, a nonprofit research lab created by the AI company Cohere.
The group audited more than 1,800 …
ai training artificial intelligence big big data copyright data data provenance datasets doubtful errors experts language language models large language large language models legal licensing machine privacy provenance public researchers team together training training data