Feb. 12, 2024, 5:43 a.m. | Riccardo CappuzzoSODA Team - Inria Saclay Gael VaroquauxSODA Team - Inria Saclay Aimee CoelhoDataiku Paolo PapottiEURECOM

cs.LG updates on arXiv.org arxiv.org

We present an in-depth analysis of data discovery in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset we developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration …

analysis analyze augmentation cs.db cs.lg data data discovery data lake data lakes discovery information lake machine machine learning merge merging paper table tables tasks

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Senior Principal, Product Strategy Operations, Cloud Data Analytics

@ Google | Sunnyvale, CA, USA; Austin, TX, USA

Data Scientist - HR BU

@ ServiceNow | Hyderabad, India