Minimizing the soul-sucking part of working with Spark, one bug at a time
PySpark debugging pt. 2
1. Invalid log directory
Error: invalid log directory /usr/local/spark/work/app-
20201203195256-0001/2/
This can be a bit tricky, check your firewall rules and make sure all of your nodes have access to the storage being used. So, master sees workers and vice versa.
2. Another ClassNotFoundException — Kafka this time:
org.apache.spark.streaming.kafka.KafkaRDDPartition
You may have your jars all in place, included in spark-submit etc, but still get this.
Not all jar versions work well unfortunately. If you have pyspark 2.4.7 try the 2.4.0 jar.
3. OOM because of broadcast joins in PySpark < 3.0.0
Broadcast join:
an optimization technique that is used to join two…