Databricks jobs are often slow for five core reasons!
1️⃣ 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄 (𝗛𝗼𝘁 𝗞𝗲𝘆𝘀)
When a few tasks take hours while most finish in minutes, you have skewed partitions. Check Spark UI > Stages for task duration variance. Fix: Use salting on hot keys or broadcast joins for small tables.
2️⃣ 𝗠𝗲𝗺𝗼𝗿𝘆 𝗦𝗽𝗶𝗹𝗹 𝘁𝗼 𝗗𝗶𝘀𝗸
Executors writing shuffle data to disk kills performance. Check Spark UI > Stages for spill metrics greater than zero. Fix: Increase spark.executor.memory, use fewer cores per executor, or enable Photon. Aim for zero spill.
3️⃣ 𝗪𝗿𝗼𝗻𝗴 𝗖𝗹𝘂𝘀𝘁𝗲𝗿 𝗦𝗶𝘇𝗶𝗻𝗴
Undersized clusters run forever; oversized ones waste money. Monitor Spark UI > Executors for CPU usage. High idle time means too big; maxed CPU means too small.
4️⃣ 𝗜𝗻𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗝𝗼𝗶𝗻𝘀 & 𝗦𝗵𝘂𝗳𝗳𝗹𝗲𝘀
Massive shuffles from sort-merge joins bottleneck performance. Check join strategy in Spark UI > SQL tab. Fix: Use broadcast joins for tables under 1GB and enable AQE (spark.sql.adaptive.enabled=true).
5️⃣ 𝗖𝗼𝗹𝗱 𝗖𝗮𝗰𝗵𝗲 & 𝗦𝗹𝗼𝘄 𝗜/𝗢
First run from cloud storage is slow; subsequent runs from local cache are fast. Check Spark UI > Storage for cache status. Fix: Use Delta Lake with Liquid Clustering and run OPTIMIZE. Photon accelerates Parquet reads by 3-5x.
---
Most engineers overcomplicate this, but 80% of slowness comes from these 5 patterns. Spark jobs are extremely fast when you systematically diagnose and fix them.
Which one hit you the hardest? Drop a number (1-5) 👇