March 22, 2024, 5:14 p.m. | Simon Grah

Towards Data Science - Medium towardsdatascience.com

This article is the second in a series about cleaning data using Large Language Models (LLMs), with a focus on identifying errors in tabular data sets.

The sketch outlines the methodology we’ll explore in this article, which focuses on evaluating the Data Dirtiness Score of a tabular data set with minimal human involvement.

The Data Dirtiness Score

Readers are encouraged to first review the introductory article on the Data Dirtiness Score, which explains the key assumptions and demonstrates how …

article automated cleaning data data cleaning data quality data quality issues data science data set data sets deep-dives detection errors explore focus human human involvement language language models large language large language models llm llms methodology outlines quality series set tabular tabular data

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Intern Large Language Models Planning (f/m/x)

@ BMW Group | Munich, DE

Data Engineer Analytics

@ Meta | Menlo Park, CA | Remote, US