Nov. 8, 2023, 1:56 p.m. | /u/mrocklin

Data Science www.reddit.com

I gave this talk at PyData NYC last week. It was fun working with devs from various projects (Dask, Arrow, Polars, Spark) in the week leading up to the event. Thought I'd share a re-recording of it here

[https://youtu.be/wKH0-zs2g\_U](https://youtu.be/wKH0-zs2g_U)

This is the result of a couple weeks of work comparing large data frameworks on benchmarks ranging in size 10GB to 10TB. No project wins. It's really interesting analyzing results though.

DuckDB and Dask are the only projects that reliably finish …

arrow benchmarks dask data datascience devs duckdb event frameworks fun nyc projects recording scale spark talk thought work

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Analyst (Digital Business Analyst)

@ Activate Interactive Pte Ltd | Singapore, Central Singapore, Singapore