Phi Vu Trinh (@vutr): "In 2026, Apache Spark will still be one of the dominant data processing engines. Here is the list of articles that help you dive deep into this infamous engine: ◉ Apache Spark overview: Architecture, Job, Stage, Task, RDD, the journey of the Spark application https://vutr.sub…"

In 2026, Apache Spark will still be one of the dominant data processing engines.

Here is the list of articles that help you dive deep into this infamous engine:

◉ Apache Spark overview: Architecture, Job, Stage, Task, RDD, the journey of the Spark application

◉ Spark resource allocation: Static vs dynamic allocation, FIFO vs Fair schedule mode

◉ Spark scheduling process: from your code to physical execution on executors

◉ Spark planning process: Catalyst, logical vs physical planning, Adaptive Query Execution in Spark 3

◉ Spark's memory management: On-heap and Off-heap memory

◉ Databricks's Spark vs Open-sourced Spark: Spark + Photon engine to boost the query performance

◉ PySpark: Spark was written in Scala, so how could we use Python with it?

◉ Spark Structured Streaming: the micro-batch processing engine

◉ Spark Connect: process data in Spark by making an API request instead of submitting an application

Hope they can help you on your Spark learning journey.

I'm writing articles for 𝟭𝟳,𝟬𝟬𝟬+ data engineers worldwide. Join the community for 𝗙𝗥𝗘𝗘 at vutr.substack.com

Jan 5

8:48 AM