Minimizing the soul-sucking part of working with Spark, one bug at a time

PySpark debugging pt. 2

Maria Karanasou
5 min readJan 5, 2023

--

Photo by 傅甬 华 on Unsplash

1. Invalid log directory

Error: invalid log directory /usr/local/spark/work/app-
20201203195256-0001/2/

This can be a bit tricky, check your firewall rules and make sure all of your nodes have access to the storage being used. So, master sees workers and vice versa.

2. Another ClassNotFoundException — Kafka this time:

org.apache.spark.streaming.kafka.KafkaRDDPartition

You may have your jars all in place, included in spark-submit etc, but still get this.

Not all jar versions work well unfortunately. If you have pyspark 2.4.7 try the 2.4.0 jar.

org.apache.spark.streaming.kafka.KafkaRDDPartition

3. OOM because of broadcast joins in PySpark < 3.0.0

Broadcast join:
an optimization technique that is used to join two

--

--

Maria Karanasou

A mom and a Software Engineer who loves to learn new things & all about ML & Big Data. Buy me a coffee to help me keep going buymeacoffee.com/mkaranasou