Nov. 8, 2023, 2 p.m. | Edd Gent

IEEE Spectrum spectrum.ieee.org



Large language models feed on big data from publicly available training sets, but most of the sets are of doubtful legal status.

The scope of the problem has been demonstrated by the newly launched Data Provenance Initiative, which brings together a multi-institutional team of machine-learning and legal experts led by researchers at the Massachusetts Institute of Technology and Cohere for AI, a nonprofit research lab created by the AI company Cohere.

The group audited more than 1,800 …

ai training artificial intelligence big big data copyright data data provenance datasets doubtful errors experts language language models large language large language models legal licensing machine privacy provenance public researchers team together training training data

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Research Scientist

@ Meta | Menlo Park, CA

Principal Data Scientist

@ Mastercard | O'Fallon, Missouri (Main Campus)