Paul Iusztin (@pauliusztin): "Serving ML models is one of the most complex steps when it comes to AI/ML in production, as you have to put all the pieces together into a unified system while considering: throughput/latency requirements infrastructure costs data and model access training-serving skew . A…"

The app for independent voices

Dec 26, 2024

Serving ML models is one of the most complex steps when it comes to AI/ML in production, as you have to put all the pieces together into a unified system while considering:

throughput/latency requirements
infrastructure costs
data and model access
training-serving skew

As we started this project with production in mind by using the Hopsworks AI Lakehouse, we can easily bypass most of these issues, such as:

the query and ranking models are accessed from the model registry;
the customer and H&M article features are accessed from the feature store using the offline and online stores depending on throughput/latency requirements;
the features are accessed from a single source of truth (feature store), solving the training-serving skew.

Estimating infrastructure costs in a PoC is more complicated. Still, we will leverage a Kubernetes cluster managed by Hopsworks, which uses KServe to scale up and down our real-time personalized recommender depending on traffic.

Thus, in this lesson, you will learn how to:

Architect offline and online inference pipelines using MLOps best practices.
Implement offline and online pipelines for an H&M real-time personalized recommender.
Deploy the online inference pipeline using the KServe engine.
Test the H&M personalized recommender from a Streamlit app.
Deploy the offline ML pipelines using GitHub Actions.

Enjoy!

Decoding AI Magazine

Deploy scalable TikTok-like recommenders

Dec 26, 2024

8:03 AM

The app for independent voices

Log in or sign up