Serving ML models is one of the most complex steps when it comes to AI/ML in production, as you have to put all the pieces together into a unified system while considering:
throughput/latency requirements
infrastructure costs
data and model access
training-serving skew
.
As we started this project with production in mind by using the Hopsworks AI Lakehouse, we can easily bypass most of these issues, such as:
the query and ranking models are accessed from the model registry;
the customer and H&M article features are accessed from the feature store using the offline and online stores depending on throughput/latency requirements;
the features are accessed from a single source of truth (feature store), solving the training-serving skew.
.
Estimating infrastructure costs in a PoC is more complicated. Still, we will leverage a Kubernetes cluster managed by Hopsworks, which uses KServe to scale up and down our real-time personalized recommender depending on traffic.
.
Thus, in this lesson, you will learn how to:
Architect offline and online inference pipelines using MLOps best practices.
Implement offline and online pipelines for an H&M real-time personalized recommender.
Deploy the online inference pipeline using the KServe engine.
Test the H&M personalized recommender from a Streamlit app.
Deploy the offline ML pipelines using GitHub Actions.