Jakub Lasak (@jakublasak)

Make money doing the work you believe in

Databricks jobs are often slow for five core reasons!

1️⃣ 𝗗𝗮𝘁𝗮 𝗦𝗸𝗲𝘄 (𝗛𝗼𝘁 𝗞𝗲𝘆𝘀)

When a few tasks take hours while most finish in minutes, you have skewed partitions. Check Spark UI > Stages for task duration variance. Fix: Use salting on hot keys or broadcast joins for small tables.

2️⃣ 𝗠𝗲𝗺𝗼𝗿𝘆 𝗦𝗽𝗶𝗹𝗹 𝘁𝗼 𝗗𝗶𝘀𝗸

Executors writing shuffle data to disk kills performance. Check Spark UI > Stages for spill metrics greater than zero. Fix: Increase spark.executor.memory, use fewer cores per executor, or enable Photon. Aim for zero spill.

3️⃣ 𝗪𝗿𝗼𝗻𝗴 𝗖𝗹𝘂𝘀𝘁𝗲𝗿 𝗦𝗶𝘇𝗶𝗻𝗴

Undersized clusters run forever; oversized ones waste money. Monitor Spark UI > Executors for CPU usage. High idle time means too big; maxed CPU means too small.

4️⃣ 𝗜𝗻𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁 𝗝𝗼𝗶𝗻𝘀 & 𝗦𝗵𝘂𝗳𝗳𝗹𝗲𝘀

Massive shuffles from sort-merge joins bottleneck performance. Check join strategy in Spark UI > SQL tab. Fix: Use broadcast joins for tables under 1GB and enable AQE (spark.sql.adaptive.enabled=true).

5️⃣ 𝗖𝗼𝗹𝗱 𝗖𝗮𝗰𝗵𝗲 & 𝗦𝗹𝗼𝘄 𝗜/𝗢

First run from cloud storage is slow; subsequent runs from local cache are fast. Check Spark UI > Storage for cache status. Fix: Use Delta Lake with Liquid Clustering and run OPTIMIZE. Photon accelerates Parquet reads by 3-5x.

---

Most engineers overcomplicate this, but 80% of slowness comes from these 5 patterns. Spark jobs are extremely fast when you systematically diagnose and fix them.

Which one hit you the hardest? Drop a number (1-5) 👇

Nov 12

11:03 AM

Make money doing the work you believe in

Log in or sign up