March 2, 2024, 5:09 p.m. | Simon Grah

Towards Data Science - Medium towardsdatascience.com

New method to measure tabular dataset quality

This article, the first in a series on data cleaning practices involving Large Language Models (LLMs), focuses on quantifying the cleanliness or dirtiness of a datasetPhoto by Fabrizio Conti on Unsplash

Starting with the Why

This article introduces a concept for evaluating the dirtiness of a dataset, a topic that presents challenges due to the lack of a tangible score or loss function related to data cleaning. The primary objective here is to …

article challenges cleaning concept data data cleaning data engineering data quality data science dataset language language models large language large language models llm llms practices series tabular

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

AI Engineering Manager

@ M47 Labs | Barcelona, Catalunya [Cataluña], Spain