The app for independent voices

I talked with some data engineering students this week about orchestration, and in particular about Dagster, and I put together a toy model talking about this trend towards built-in lineage tracking in orchestrators.

First, you need a generic model of a data product — I have:

  • physical layer is data and compute

  • orchestration ties these two together at a specific time

  • asset definition — some might call this semantic — is where the boundaries and decisions about what constitutes the thing are defined

  • serving layer is the interface into it

An easy example is a table on Snowflake:

  • physical: storage and compute needed to access it

  • orchestration: query planner, runtime, etc. that know how to access those resources

  • asset definition: logic that created the table and metadata around it

  • serving: snowflake ui and other ways of accessing it.

Once you have this basic model in place, you can start looking across the platform and identifying multiple different stacks of these products. Within a system, such as Stitch, Snowflake, Sagemaker, Rockset, you would have all these products with data dependencies.

Classically, there would be no explicit dependencies — they would be siloed.

With Airflow and other orchestration tools, you bring these together in a single orchestration tool, but it does not meaningfully extend into the definitional layer.

With Dagster, you start pulling these definitions together into a single plane. ML model registries also do this, as does dbt — they just only apply within certain types of systems.

One of the students made an astute observation: on the top and bottom of the modules (serving / physical), there is a massive heterogeneity, so trying to merge across parts of the stack is extremely challenging. (We won’t get to a single serving layer anytime soon.)

That’s one reason catalogs tend to become so challenging to understand so quickly — in many cases we have different serving derivatives of the same asset getting proliferated across the organization. Focusing on the more generic asset definitions is a higher leverage way to solve this problem of lineage tracing.

Apr 30, 2023
at
11:25 AM

Log in or sign up

Join the most interesting and insightful discussions.