May 23, 2024, 11:22 a.m. | /u/mrocklin

Data Science www.reddit.com

I hit publish on a blogpost last week on running Spark, Dask, DuckDB, and Polars on the TPC-H benchmark across a variety of scales (10 GiB, 100 GiB, 1 TiB, 10 TiB), both locally on a Macbook Pro and on the cloud.  It’s a broad set of configurations.  The results are interesting.

No project wins uniformly.  They all perform differently at different scales: 

* DuckDB and Polars are crazy fast on local machines
* Dask and DuckDB seem to win …

benchmark benchmarks cloud dask datascience duckdb macbook results running set spark

AI Focused Biochemistry Postdoctoral Fellow

@ Lawrence Berkeley National Lab | Berkeley, CA

Senior Data Engineer

@ Displate | Warsaw

Solutions Engineer

@ Stability AI | United States

Lead BizOps Engineer

@ Mastercard | O'Fallon, Missouri (Main Campus)

Senior Solution Architect

@ Cognite | Kuala Lumpur

Senior Front-end Engineer

@ Cognite | Bengaluru