Understanding Spark Connect (Reynold Xin’s keynote on Data+AI Summit 2022)

Stan Lin
3 min readJul 3, 2022

Origin: https://coderstan.com/2022/07/02/spark-connect-reynold-xins-keynote-dataai-summit-2022/

Reynold gave a great intro about some recent updates in Spark community along with some exciting new features. Here I am going to take a stab on explaining a new feature called “Spark Connect”.

What is “Spark Connect” in short?

“Spark Connect” creates a “thin client” to enable Spark query capability on low-compute devices, with re-architected Spark Driver to get around some short-comings in the Monolithic Driver and better support for multi-tenancy.

What should you use Spark Connect for?

  • Run Spark jobs from your app or low-compute devices.
  • Better experience on your multi-tenancy clusters
  • Less OOM: Manage your Spark job memory at client to provide explicit resource allocation.
  • Avoid dependency conflicts: Dependency defines in the client for each application.
  • Decoupled upgradability: Decouple Spark server and client version updates.
  • Powerful debuggability: Step-through debugging feature.
  • Better observability: Access logs/metrics in the client (instead of viewing from a centralized log storage)

My take on Spark Connect

Overall, I think Spark Connect addressed the isolation and asynchronization part of the problem. It is a very good feature to have in terms of enabling more real-world use cases. However, I would be cautious about getting into the business of multi-tenancy before addressing some of the fundamental requirements: Queuing and Scaling. Here are a couple of problems I faced in H1CY22 on a multi-tenant Spark cluster:

Missing queuing mechanism to handle spiky job traffic

Spark is not very good at handling “spiky traffic”. This is a big problem I have in my daily work, which requires processing thousands of Spark jobs at M365 scale. As far as I can tell, there is not a very good mechanism to smooth out the traffic of 10k parallel jobs and have them line up on the cluster. This is leading to us having to create a customized job rate control solution on top of our Spark cluster (HDInsights).

Apache Livy job submission throttling

This is yet another problem that gives me headache in H1CY22 while scaling out to processing 50K tenant’s graph embedding trainings. Our compliant clusters use Apache Livy as a proxy service to deliver job payloads into Spark cluster. I was able to get about 200-ish parallel jobs submitted in one go without taking down the cluster in H1. However, I do look forward to having this number 100x since I don’t see the point of capping the submission rate of jobs while there is plenty of resources to host more jobs.

Originally published at http://coderstan.com on July 3, 2022.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Stan Lin
Stan Lin

Written by Stan Lin

Talks about A.I., Machine Learning, MLOps, and Big Data technologies. Tech Lead, Graph Intelligence, MSAI | Snowboarder | Youtuber | Options/Crypto Trader

No responses yet

Write a response