May 11, 2022, 2 p.m. | Michael Berk

Towards Data Science - Medium towardsdatascience.com

Exactly what you need, and no more

Photo by John Bakator on Unsplash

There are lots of overly-complex posts about data skew, a deceptively simple topic. In this post, we will cover the necessary basics in 5minutes.

The primary source for this post was Spark: The Definitive Guide and here’s the code.

Let’s dive in…

What is Data Skew?

In spark, data are split into chunk of rows, then stored on worker nodes as shown in figure 1.

Figure …

big data data data science pyspark python skew spark

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Machine Learning Engineer (MLOps)

@ Promaton | Remote, Europe

Data Analyst (Digital Business Analyst)

@ Activate Interactive Pte Ltd | Singapore, Central Singapore, Singapore