Feb. 13, 2024, 1:54 p.m. | Abhijith C

Towards AI - Medium pub.towardsai.net

Optimize Spark plans using deterministic and non-deterministic UDFs

Photo by Samuel Sianipar on Unsplash

Originally published on my blog.

When processing big data, efficiency is key. It’s not uncommon to be caught up in long debugging cycles when working with Spark. I was recently caught in such a debugging train when one of my pipelines was taking longer than expected. It was a simple structured streaming pipeline that was listening to a Kafka topic for events and performing some …

big big data blog data debugging efficiency key machine learning mlops optimization pipelines processing pyspark spark train

AI Research Scientist

@ Vara | Berlin, Germany and Remote

Data Architect

@ University of Texas at Austin | Austin, TX

Data ETL Engineer

@ University of Texas at Austin | Austin, TX

Lead GNSS Data Scientist

@ Lurra Systems | Melbourne

Senior Data Engineer (m/f/d)

@ Project A Ventures | Berlin, Germany

Principle Research Scientist

@ Analog Devices | US, MA, Boston