June 2, 2023, 1:38 p.m. | /u/100GB-CSV

Data Science www.reddit.com

Testing Data: 300 Million Rows x 14 Columns

Testing Functions: Read Parquet File => Filter => Group By => Write CSV File

DuckDB: 3.9s Polars: 16.2s

Testing machine: 8 cores and 32 GB memory

====================================================

import duckdb

import time

s = time.time()

con = duckdb.connect()

con.execute("""copy (SELECT Ledger, Account, DC, Currency, SUM(Base\_Amount) as Total\_Base\_Amount

FROM read\_parquet('input/300-MillionRows.parquet')

WHERE Ledger>='L30' AND Ledger <='L70'

GROUP BY Ledger, Account, DC, Currency)

to 'output/DuckFilterGroupByParquet.csv' (format csv, header true);""")

e = time.time()

print("DuckDB FilterGroupBy Parquet Time …

benchmarking csv currency data datascience duckdb filter import machine memory parquet testing

AI Engineer Intern, Agents

@ Occam AI | US

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Data Engineer - Takealot Group (Takealot.com | Superbalist.com | Mr D Food)

@ takealot.com | Cape Town