Feb. 12, 2024, 5:43 a.m. | Riccardo CappuzzoSODA Team - Inria Saclay Gael VaroquauxSODA Team - Inria Saclay Aimee CoelhoDataiku Paolo PapottiEURECOM

cs.LG updates on arXiv.org arxiv.org

We present an in-depth analysis of data discovery in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset we developed as a tool for benchmarking this data discovery task -- and Open Data US, a well-referenced real data lake. Through systematic exploration …

analysis analyze augmentation cs.db cs.lg data data discovery data lake data lakes discovery information lake machine machine learning merge merging paper table tables tasks

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US