Aug. 11, 2023, 5:37 p.m. | Gianpi Colonna

Towards Data Science - Medium towardsdatascience.com

Optimising Output File Size in Apache Spark

A Comprehensive Guide on Managing Partitions, Repartition, and Coalesce Operations

Photo by zhao chen on Unsplash

Picture yourself at the helm of a large Spark data processing operation. One often-mentioned rule of thumb in Spark optimisation discourse is that for the best I/O performance and enhanced parallelism, each data file should hover around the size of 128Mb, which is the default partition size when reading a file [1].

Imagine your files as …

apache apache spark big data chen coalesce data data processing data science discourse guide helm optimisation optimization performance processing spark

Software Engineer for AI Training Data (School Specific)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Python)

@ G2i Inc | Remote

Software Engineer for AI Training Data (Tier 2)

@ G2i Inc | Remote

Data Engineer

@ Lemon.io | Remote: Europe, LATAM, Canada, UK, Asia, Oceania

Artificial Intelligence – Bioinformatic Expert

@ University of Texas Medical Branch | Galveston, TX

Lead Developer (AI)

@ Cere Network | San Francisco, US