Web: https://towardsdatascience.com/data-skew-in-pyspark-783d529a9dd7?source=rss----7f60cf5620c9---4

May 11, 2022, 2 p.m. | Michael Berk

Towards Data Science - Medium towardsdatascience.com

Exactly what you need, and no more

Photo by John Bakator on Unsplash

There are lots of overly-complex posts about data skew, a deceptively simple topic. In this post, we will cover the necessary basics in 5minutes.

The primary source for this post was Spark: The Definitive Guide and here’s the code.

Let’s dive in…

What is Data Skew?

In spark, data are split into chunk of rows, then stored on worker nodes as shown in figure 1.

Figure …

big data data data science pyspark python skew spark

