all AI news
Reproducible data science over data lakes: replayable data pipelines with Bauplan and Nessie
April 23, 2024, 4:43 a.m. | Jacopo Tagliabue, Ciro Greco
cs.LG updates on arXiv.org arxiv.org
Abstract: As the Lakehouse architecture becomes more widespread, ensuring the reproducibility of data workloads over data lakes emerges as a crucial concern for data engineers. However, achieving reproducibility remains challenging. The size of data pipelines contributes to slow testing and iterations, while the intertwining of business logic and data management complicates debugging and increases error susceptibility. In this paper, we highlight recent advancements made at Bauplan in addressing this challenge. We introduce a system designed to …
abstract architecture arxiv cs.db cs.lg data data engineers data lakes data pipelines data science data workloads engineers however lakehouse lakehouse architecture pipelines reproducibility science testing type workloads
More from arxiv.org / cs.LG updates on arXiv.org
Jobs in AI, ML, Big Data
Software Engineer for AI Training Data (School Specific)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Python)
@ G2i Inc | Remote
Software Engineer for AI Training Data (Tier 2)
@ G2i Inc | Remote
Data Engineer
@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania
Artificial Intelligence – Bioinformatic Expert
@ University of Texas Medical Branch | Galveston, TX
Lead Developer (AI)
@ Cere Network | San Francisco, US