Understanding Spark Connect (Reynold Xin’s keynote on Data+AI Summit 2022)

3 min readJul 3, 2022

My understanding of Spark Connect (Reynold Xin's keynote on Data+AI Summit 2022)

Reynold gave a great intro about some recent updates in Spark community along with some exciting new features. Here I…

coderstan.com

Origin: https://coderstan.com/2022/07/02/spark-connect-reynold-xins-keynote-dataai-summit-2022/

Reynold gave a great intro about some recent updates in Spark community along with some exciting new features. Here I am going to take a stab on explaining a new feature called “Spark Connect”.

What is “Spark Connect” in short?

“Spark Connect” creates a “thin client” to enable Spark query capability on low-compute devices, with re-architected Spark Driver to get around some short-comings in the Monolithic Driver and better support for multi-tenancy.

What should you use Spark Connect for?

Run Spark jobs from your app or low-compute devices.
Better experience on your multi-tenancy clusters
Less OOM: Manage your Spark job memory at client to provide explicit resource allocation.
Avoid dependency conflicts: Dependency defines in the client for each application.
Decoupled upgradability: Decouple Spark server and client version updates.
Powerful debuggability: Step-through debugging feature.
Better observability: Access logs/metrics in the client (instead of viewing from a centralized log storage)

My take on Spark Connect

Overall, I think Spark Connect addressed the isolation and asynchronization part of the problem. It is a very good feature to have in terms of enabling more real-world use cases. However, I would be cautious about getting into the business of multi-tenancy before addressing some of the fundamental requirements: Queuing and Scaling. Here are a couple of problems I faced in H1CY22 on a multi-tenant Spark cluster:

Missing queuing mechanism to handle spiky job traffic

Spark is not very good at handling “spiky traffic”. This is a big problem I have in my daily work, which requires processing thousands of Spark jobs at M365 scale. As far as I can tell, there is not a very good mechanism to smooth out the traffic of 10k parallel jobs and have them line up on the cluster. This is leading to us having to create a customized job rate control solution on top of our Spark cluster (HDInsights).

Apache Livy job submission throttling

This is yet another problem that gives me headache in H1CY22 while scaling out to processing 50K tenant’s graph embedding trainings. Our compliant clusters use Apache Livy as a proxy service to deliver job payloads into Spark cluster. I was able to get about 200-ish parallel jobs submitted in one go without taking down the cluster in H1. However, I do look forward to having this number 100x since I don’t see the point of capping the submission rate of jobs while there is plenty of resources to host more jobs.

Originally published at http://coderstan.com on July 3, 2022.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Written by Stan Lin

87 Followers

14 Following

Talks about A.I., Machine Learning, MLOps, and Big Data technologies. Tech Lead, Graph Intelligence, MSAI | Snowboarder | Youtuber | Options/Crypto Trader

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

More from Stan Lin

Apache Spark on Kubernetes-Lessons Learned from Launching Millions of Spark Executors (Databricks…

Stan Lin

Apache Spark on Kubernetes-Lessons Learned from Launching Millions of Spark Executors (Databricks…

This article summarize an exciting sharing session hosted by Zhou Jiang, Aaruna Godthi from Apple on Data+AI Summit 2022. In this session…

Jul 15, 2022

Beyond Monitoring: The Rise of Data Observability (Databricks Data+AI Summit 2022)

Stan Lin

Beyond Monitoring: The Rise of Data Observability (Databricks Data+AI Summit 2022)

This post re-captured ideas in a brilliant session hosted by Barr Moses, CEO and co-founder of Monte Carlo Data, on Data+AI Summit 2022. In…

Jul 16, 2022

MLOps at DoorDash (Data+A.I. summit 2022)

Stan Lin

MLOps at DoorDash (Data+A.I. summit 2022)

This post is a summary of a insightful tech talk by Hien Luu, Head of ML platform in Doordash about how DoorDash’s MLOps infrastructure…

Jul 12, 2022

How AI increase drive-through sales for Starbucks (Data+A.I. summit 2022)

Stan Lin

How AI increase drive-through sales for Starbucks (Data+A.I. summit 2022)

This post summarize a great sharing on Data+A.I. summit 2022 on how Starbucks implements a Reinforcement Learning solution to improve its…

Jul 11, 2022

See all from Stan Lin

Recommended from Medium

AI & Data Engineering

CICD in Azure Databricks

Streamline CI/CD for Data and AI Projects with Databricks Asset Bundles (DAB) and Github Actions

Mar 21

Beginner’s Guide to Databricks File System (DBFS): What It Is and How to Use It

Suraj Jeswara

Beginner’s Guide to Databricks File System (DBFS): What It Is and How to Use It

If you’re new to Databricks, one of the first things you’ll encounter is DBFS, or Databricks File System. It’s the foundation of how files…

Nov 27, 2024

Extracting Data from an API on Databricks

Ryan Chynoweth

Extracting Data from an API on Databricks

Introduction

Feb 11, 2024

Why Parquet Is So Much Faster Than CSV in PySpark (with Real Numbers)

Kaizen

Why Parquet Is So Much Faster Than CSV in PySpark (with Real Numbers)

A few days ago, I was running some data transformation jobs in PySpark. I started with a CSV file — loaded it, applied a few…

3d ago

Implementing Unity Catalog with Medallion Architecture: A Mini Project

Nidhi Gupta

Implementing Unity Catalog with Medallion Architecture: A Mini Project

Project Description: Enable a Databricks workspace with Unity Catalog for centralized data governance and access control. Implement a…

Feb 16

Handling Large Data Volumes (100GB — 1TB) in PySpark

Ramesh Hariharan

Handling Large Data Volumes (100GB — 1TB) in PySpark

Processing large volumes of data efficiently is crucial for businesses dealing with analytics, machine learning, and real-time data…

Mar 19

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech