April 20, 2023, 6:49 p.m. | Alex Hern UK technology editor

Artificial intelligence (AI) | The Guardian www.theguardian.com

Investigations reveal limited efforts to ‘clean’ datasets of fascist, pirated and malicious material

Fresh fears have been raised about the training material used for some of the largest and most powerful artificial intelligence models, after several investigations exposed the fascist, pirated and malicious sources from which the data is harvested.

One such dataset is the Colossal Clean Crawled Corpus, or C4, assembled by Google from more than 15m websites and used to train both the search engine’s LaMDA AI as …

ai systems artificial artificial intelligence artificial intelligence (ai) chatbots computing data dataset datasets google gpt intelligence investigations lamda llama material meta reading search search engine systems technology training training material world news

More from www.theguardian.com / Artificial intelligence (AI) | The Guardian

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Principal Applied Scientist

@ Microsoft | Redmond, Washington, United States

Data Analyst / Action Officer

@ OASYS, INC. | OASYS, INC., Pratt Avenue Northwest, Huntsville, AL, United States