Aug. 11, 2023, 5:37 p.m. | Gianpi Colonna

Towards Data Science - Medium towardsdatascience.com

Optimising Output File Size in Apache Spark

A Comprehensive Guide on Managing Partitions, Repartition, and Coalesce Operations

Photo by zhao chen on Unsplash

Picture yourself at the helm of a large Spark data processing operation. One often-mentioned rule of thumb in Spark optimisation discourse is that for the best I/O performance and enhanced parallelism, each data file should hover around the size of 128Mb, which is the default partition size when reading a file [1].

Imagine your files as …

apache apache spark big data chen coalesce data data processing data science discourse guide helm optimisation optimization performance processing spark

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Intern Large Language Models Planning (f/m/x)

@ BMW Group | Munich, DE

Data Engineer Analytics

@ Meta | Menlo Park, CA | Remote, US