AWS Sagemaker Series: Pipelines In Action
In this blog series, we will give you a brief overview of Amazon SageMaker. We will also share some of our experience working with the latest SageMaker capabilities in order to create an end-to-end solution for developing, maintaining, and supporting scalable and retrainable machine learning models in the cloud. And of course, we will only be using standard well-tested and scalable cloud components from AWS and machine learning good practices.
In the first article of the series, we will dive into the SageMaker Abalone MLOps pipeline example and how we enhanced this out-of-the-box template to suit our own use case. We will also demonstrate some more complex pipeline capabilities by means of the Kaggle Lending club data set. We will also touch on how we set up and customized the deployment process with SageMaker Inference Pipelines.
Amazon SageMaker is a fully managed machine learning platform that provides data scientists with all the necessary tools to build, train, and deploy models while adhering to the standard ML workflow practices.
SageMaker Pipelines allows you to set up an automated workflow for publishing machine learning models and automating the versioning and deployment of these models. A pipeline consists of steps that run systematically to train and publish a machine learning model according to criteria generally defined by a data science team. This pipeline should also be modeled according to a well-defined machine learning workflow.
The Abalone template highlights some of the various pipeline steps that can be configured when creating an ML workflow using SageMaker Pipelines for solving the abalone sea snail age prediction problem from physical measurements, which are easier to obtain rather than cutting the shell through the cone, staining it and counting the number of rings under a microscope. The template includes a step set up for raw data preprocessing, training an XGBoost regression model, model evaluation on the test set, and conditional execution of model registration and launching a batch transform job.
The Abalone pipeline provides a very standard example of the end-to-end workflow capabilities of SageMaker Pipelines, but this pipeline is not suitable for all use cases. There is definitely room and flexibility to customize additional pipeline steps. Our goal is to adapt the Abalone pipeline example and create a more tailored solution.
Lending Club is a second use case being used in these blog series. Lending Club is a peer-to-peer lending company based in the US. They match people looking to invest money with people looking to borrow money.
We will be using a dataset containing Lending Club customer information from 2007-2010 pertaining to loans, credit scores, and financial inquiries, among others. With these types of financial features, we will try to solve a binary classification problem by predicting whether or not customers will fully pay back their loans on time. The dataset also contains labels, indicating if a customer did in fact pay back their loan. An open-source version of this dataset can be found on Kaggle.
With SageMaker Pipelines, we can construct a pipeline that can be continuously improved, retrained on new data, and easily pushed to production. Additionally, each model iteration during development is stored in a versioned development repository, making it possible to also roll back to previous model versions in production.
Improving the Abalone Pipeline
For our Lending Club problem, we will demonstrate a slightly different approach to the data preprocessing at the beginning of the pipeline by fitting a custom SKLearn preprocessing model. We will show how to incorporate a hyperparameter tuning step to find the best-performing model. Here, we can still use the SageMaker built-in XGBoost algorithm, but we will adjust the tuning and evaluation steps to support our classification task.
We will also explain how we can further improve our pipeline by including custom processing steps in order to mitigate bias at various points in the workflow by using the smclarify open-source package.
Environment Set Up
In order to get started with using SageMaker, you need to create a SageMaker domain. A domain consists of a SageMaker tenant in your chosen region that allows you to configure global settings and create user accounts for accessing SageMaker Studio.
You will be asked to specify an execution role when creating your SageMaker domain and users. For a quick start experience, you can let SageMaker create an IAM role for you with an AmazonSageMakerFullAccess policy. In practice, you might want to assign fine-grained roles to different types of users depending on the level of access and features they are allowed to use.
Amazon S3 is the core of your Lake House Architecture. It is where you will find valuable business datasets that can be processed and analyzed using SageMaker, and where the results of your data science activities are stored. We will add to our data architecture two S3 buckets that will support our ML workflow:
: The SageMaker Data Lake bucket stores all required items needed to generate machine learning models. It contains curated datasets that will be used for data exploration and feature engineering as well as training data and scripts in order to create and evaluate our models.
: The SageMaker Artifacts bucket stores the output of our data science activities that can be used for creating business reports and insights. It contains versioned data products such as enriched datasets for business intelligence and trained machine learning models for inference on new data.
The essential requirement before starting a machine learning project is the availability of domain-specific datasets in your data lake. Building a robust data platform is out of scope for this blog series, but in general, we recommend having a well-maintained AWS Glue Data Catalog that can be used by your data science team for ad-hoc data exploration using Amazon Athena in order to build the most suitable dataset for their machine learning problem. This curated dataset, also known as a gold dataset, should land in your sagemaker-data-lake bucket for further analyses using SageMaker Studio.
In conclusion, the default Abalone pipeline template definitely serves as a good starting point and introduction to data scientists and engineers for automating the ML workflow with Amazon SageMaker. However, we also want to provide a guide for how to really make use of the various possibilities when using SageMaker Pipelines and expand on this example by adding more advanced functionality.
It should also be pointed out that the pipeline that we will customize later on in this series is in no way set in stone. It is merely a way of rearranging the Abalone code to make it more flexible. Data scientists and engineers can of course apply the same logic for adding or removing steps in the pipeline for other use cases.
Be sure to keep an eye out for our next blog post, where we will continue by going into more detail about the specific steps that make up the Abalone pipeline, which is available out-of-the-box as a model build MLOps template when creating a new SageMaker Project.