Imbalanced datasets and the machine learning cycle

Imbalanced datasets and the machine learning cycle

Detecting and overcoming imbalanced datasets

Imbalanced data

When looking to innovate and develop new solutions or products, companies are often faced with insufficient data to support their new ideas. Most of the data reflects what is already known – how customers behave when engaging with the brand or what is the range of sensor readings and output of the production machines in a factory. But how do you target a smaller yet important part of a customer base, even when having little data to work with? How do you quickly and correctly identify abnormal behavior in the machines in your factory, when the abnormal events are few and far between? This is what we’ll try to accomplish in this article.

At Anchormen we often need to tackle these challenges of imbalanced data: cases where the target classes we want to label are not well-represented in the data. We address these challenges in every stage of the model training pipeline, from sampling methods to optimizing on the relevant evaluation metrics.

For one of our customers, a clothing and accessories manufacturer, we wanted to train a model that identified a certain customer persona of a price-conscious but also very hip individual, who is up-to-date with the current fashion. A preliminary survey found that these customers make up no more than 10% of the customer base. Let’s say we’re already working with a small dataset of around 20,000 labeled customers. 10% of that is 2,000. This is a very small number of observations to use as a basis for an accurate consumer profile.

Up- and -down sampling to the rescue

How do we make sure the model we train on imbalanced data picks up on the relevant features important to identify classes of different sizes? In order to handle uneven class representation we can manipulate the class proportions in the training data: downsampling and upsampling. In the first, instead of sampling training data that would reflect the class breakdown in the entire dataset, we can downsample the majority class so that its proportion is equal to that of the minority class, making it an equal representation. This should keep our machine learning model from focusing on the majority class when learning to identify the different classes.

But downsampling by itself won’t quiet our worries about the model not learning enough from the features important for identifying the minority class. While we can easily downsample the majority class and have enough data for training, evaluation and test sets, our pool of minority class data is much smaller. We can’t reuse it in the different sets as we need to be sure the model is robust to unseen data. Also, by undersampling we’re using less data in general, potentially missing important information.

Instead of downsampling, we can upsample the data by generating synthetic data based on the distribution of values in the minority class. In the SMOTE (Synthetic Minority Oversampling TEchnique) method we’ve used, the new data points were created by drawing vectors to the minority class data points’ nearest neighbours (the black points in the image below) and then calculating random points (the red points) on these vectors.

NRSBoundary smote
Image taken from Hu & Li (2013), A Novel Boundary Oversampling Algorithm Based on Neighborhood Rough Set Model: NRSBoundary-SMOTE.

It’s very important to apply SMOTE only to the training set for the same reason we normalize the data separately in the training and test set. If we generate synthetic data based on the entire data set (and especially on data points that end up in the test set), we can’t say the model was evaluated on unseen data. If our synthetic data is based on some of the data points in the test set, then the model will indirectly have “seen” the data in the test set, because the synthentic data points reflect some information captured by the test set data. This pitfall is called information leakage and is best avoided by making sure that the training and test set pass through the preprocessing pipeline separately, for example by using the sci-kit learn Pipeline class that allows you to chain transformers and estimators and apply them to the training and test set separately without causing information leakage.

Evaluation metrics

So, we’ve trained our model on upsampled data to make sure we make the most of our minority, desired class. How do we evaluate the model? In a previous blog post, we’ve talked about the importance of choosing the right metric for the prediction problem we are solving. For example, for a client in the auto- and truck-parts manufacturing business, we wanted to predict the quality of the manufactured product based on input coming from production machines. Our focus was clearly the good products, but being accurate at identifying the low-quality products and the machine attributes that can correctly predict them was paramount to our client’s manufacturing processes. If low-quality products (Class 1 in the table below) consist of about 5% of all the products and we’ve managed to correctly identify 90 good products (Class 0 in the table below) and 1 bad product out of a 100 products, we get a 91% accuracy* score for the model (=(True Positives + True Negatives) / All data points). But while this accuracy score sounds great, we have only spotted 1 out of 5 of the bad products; our precision for the minority class is 1/6 or 16.67%! (=True Positives/(True Positives + False Positives).

evaluation metric table

This means that accuracy isn’t the evaluation metric for us. Precision could be, but only if we looked at the precision for the minority class. What we want to know about our model is not just how many of the 5 (minority class) bad products we have correctly classified; we want to know how many of all of the bad products we were able to spot (and how many we’ve missed). This is our recall score, which points to the model’s sensitivity. In our example, we’ve identified 1 out of 5, which gives us a recall of 25% (=True Positives/(True Positives + False Negatives)), and this means we need to improve our model (or get more relevant data). Recall provides a more balanced metric to evaluate the model: It doesn’t only say how many of your labels are correct but also how many potential products you’ve missed. This gives us an idea of how well the model fares with respect to all classes.

With this in mind, we can train a model and optimize it based on these metrics. In our project we wanted to strike a balance between precision-recall and so when we compared models and tuned the hyperparameters of our chosen Random Forest model, we looked for the parameters that yielded a model with the highest AUC score and highest precision-recall score.

Predicting client types: Case Study in Imbalanced Data

Let’s illustrate the benefits of over-sampling and striking the right balance between recall and precision with an open dataset. We’re using here a small dataset taken from Kaggle that includes information about various customers, such as sex, level of education and credit amount, with the target feature being good or bad customer. It’s not clear from the little background available for this dataset what the metrics are for classifying a customer as good or bad, but since this is an example dataset we use to discuss a different point, we’ll have to set aside this important issue. While we’re sidestepping this topic in this blog, it’s important to point out that addressing this issue is the first step we take in our preliminary scan of the project and data at Anchormen: We first strive to understand what the business questions and KPIs are, and how much insight pertaining to these business questions we can derive from the data available. If there isn’t a clear connection between the data and desired insights and if the data is questionable, a model with high precision and recall is of no use to our customers.

import matplotlib.pyplot as plt # plotting
import seaborn as sns # prettier plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import mlflow
from mlflow.models.signature import infer_signature
from mlflow.types.schema import Schema, ColSpec
from mlflow.types.schema import Schema, ColSpec
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
from hyperopt.pyll import scope
import hyperopt
from sklearn.metrics import recall_score, f1_score, roc_auc_score, plot_roc_curve, classification_report
import category_encoders
from imblearn.pipeline import Pipeline as SMOTEpipe
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectFromModel, VarianceThreshold
%matplotlib inline
from IPython.display import Image
import warnings

Inspecting the data and feature engineering

After we’ve loaded the necessary libraries, we’ll read the file and inspect the dataset: There are 1723 observations and 14 columns.

df1 = pd.read_csv('clients.csv', delimiter=',')
df1.dataframeName = 'clients.csv'
print(f'There are {df1.shape[0]} rows and {df1.shape[1]} columns')

There are 1723 rows and 14 columns



We can easily identify the target feature as the column clearly named bad_client_target. We’ll next pick out the numeric and binary columns and define them as numeric_features to tell them apart from the categorical_features and binary_features, as these will be handled differently in the preprocessing pipeline step of model training. We won’t be using the month column, as while the month of recorded transaction or client data and label entry might correlate with the client’s label, we’re looking for predictable features of a client class (not necessarily when the fraud event was perhaps recorded).

When looking at the distribution of client labels, we notice a huge skew in the data: Only 11.38% of the observations we have are bad client labels. We have an imbalanced dataset with the class of concern being the minority class. This is where SMOTE will be of great help.

numeric_features = ['credit_amount', 'age', 'income']
binary_features = ['having_children_flg', 'is_client']
categorical_features = ['sex', 'education', 'product_type', 'family_status', 'phone_operator', 'region', 'credit_term']
features = numeric_features + categorical_features + binary_features
target = 'bad_client_target'
datetime_col = ['month']
df1.loc[:, target].value_counts()

0 1527
1 196
Name: bad_client_target, dtype: int64

We next look at correlations between features. We see that credit_amount is somewhat correlated with both income (0.37) and credit_term (0.49). This would be a good reason to drop credit_amount to avoid collinearity between the features, as some models aren’t very well equipped with handling collinearity. We’ll be training a Random Forest model in this blog. One catch of this otherwise excellent algorithm is that correlated features may be given similar importance scores and reduced importance scores in comparison with a model that excluded a subset of these correlated features. This has to do with the algorithm’s procedure: At each iteration, a weak model is trained on a random subset of the features and a subset of the data. Once one of the correlated features is used as a predictor, a subsequent correlated feature doesn’t add much to reduce the impurity (i.e. how likely we are to misclassify a random observation given the tree split using this feature), as it’s already been done by the first correlated feature. While this isn’t usually a concern in model performance, as multiple trees with different sets of features will jointly make an accurate prediction, it is an issue in the step of model explanation.

We’ll start by training the model on all features and calculate their importance to determine whether we should exclude ones with very low importance. As a side note, the number of features in this data set (12) is fairly small, especially in comparison with the number of features we had to handle in our project with the auto- and truck-part manufacturing client (upwards of 300). If you encounter the latter case, we’d recommend a feature selection step prior to model training. We’ve used the meta-transformer SelectFromModel from scikit-learn’s module feature_selection and set a lower threshold of 0.05 importance. The threshold we set was arbitrary, in order to yield a manageable number of features; but another, stricter strategy could be selecting just the features with importance scores above the mean importance score.


print(f"Correlation between credit_amount and the following features:\n{df1.corr().loc['credit_amount', ['income', 'credit_term']]}")

Correlation between credit_amount and the following features:
income 0.372995
credit_term 0.497040
Name: credit_amount, dtype: float64

Next we inspect the distribution plots of sampled columns, which tell us that the credit_amount, age and income are right-skewed, and credit term looks to be a categorical rather than continuous feature, with most values centred around a few values, namely: 12, 6, 10, 18, and 24 and 3 (in that order of frequency).

df1[numeric_features + binary_features + [target]].hist(density=True, figsize=(10,10))

We turn next to categorical features and see that sex (Male, Female), family_status (Married, Unmarried, Another), and region (0 to 2) have 2-3 levels, which can be easily handled by any categorical feature preprocessing (e.g., one-hot encoding).

df1[categorical_features].agg(['count', 'size', 'nunique'])

Features with more than 5 levels are tricky for a two main reasons:

  1. If we have many of them and they turn out to have high importance, for example product_type. The feature product_type comes out as important — see the model training part of this post — and has 22 levels (unique values)! This may result in the model predicting client type by the product category. This doesn’t tell us what the essential properties are of a client that would help us predict if they’re good or bad; rather, it tell us for which product categories we’re more likely to encounter a bad client. This will lead us to discriminate against the more frequent product categories with a large proportion of bad clients, and possibly misclassify actual bad clients in less frequent product categories. One way to avoid this danger is either excluding categorical features with a large number of levels, or reducing the number of levels by collapsing low frequency levels with a new label (e.g. ‘other’).
  2. When using one-hot encoding or a label binarizer when handling categorical features, a large number of levels for the category features will result in a large number of features, which may lead to a high-dimensional space. The feature product_type will add 21 new features (the 22 levels minus the categorical column product_type we’ll drop).

We’ll be using a binary encoder using the category_encoders library to avoid these two issues.

A binary encoder utilizes a different strategy than one-hot-encoding’s dummy variables or the binarized level labels:

  1. First, the encoder assigns the levels ordinal integer numbers
  2. Then, those integers are converted into binary code
  3. Finally, the digits of this binary code are split into separate columns.

As an example, binary encoding of the three level category family_status would result in a two-dimensional space, as oppose to the one-hot-encoding dimensional space, which is always as large as the number of the category’s levels.

Building the model pipeline and and adding an over-sampling step

Let’s turn now to the entire pipeline with the preprocessing we’ve discussed. We would first evaluate the model’s performance without addressing the data imbalance. We’ll build a model pipeline with scikit-learn’s class Pipeline.

# make training and test sets
X = df1.loc[:, features]
y = df1.loc[:, target]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# substantiate scaler and binary encoder and define pipeline preprocessing steps
numeric_transformer = StandardScaler()
categorical_transformer = category_encoders.BinaryEncoder(cols=categorical_features)
preprocessor = ColumnTransformer(
('num', numeric_transformer, numeric_features + binary_features),
('cat', categorical_transformer, categorical_features)])
# define pipeline
rf_model = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', RandomForestClassifier())]), y_train)
print("model score: %.3f" % rf_model.score(X_test, y_test))

model score: 0.889

The model accuracy is 89%, but as we know that only 11% of the observations are bad clients, the model could mislabel almost all bad clients as good and would still maintain this accuracy score. We are more concerned with bad clients, and so we need to look at the model’s sensitivity/recall (true positive rate) rather than specificity (true negative rate). In order to compare the two, we can plot the ROC (Receiver Operator Characteristic) curve.

plot_roc_curve(rf_model, X_test, y_test)

As we’re aiming to have as high a true positive rate as possible and as low a false positive rate as possible, the AUC score of the current model has to be improved. We further see in the classification report below that while the classification of class 0 (good_client) is satisfactory, recall for class 1 has no predictive value.

print(classification_report(y_test, rf_model.predict(X_test)))

predictions = rf_model.predict(X_test)
recall = recall_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
print(f"recall: {recall:.3f}, roc_auc: {roc_auc:.3f}")

recall: 0.022, roc_auc: 0.507

Let’s now modify our pipeline to include SMOTE. The library imblearn has a few useful classes:

  1. A dedicated Pipeline for incorporating SMOTE in the model training workflow
  2. The class that performs the over-sampling.

The basic SMOTE class only handles numeric data, so if you’re not transforming your categorical features, use SMOTENC, which is suitable for datasets with categorical features. We’ll use SMOTE as we’re binarizing our categorical_features.

preprocessor = ColumnTransformer(
('num', numeric_transformer, numeric_features + binary_features),
('cat', categorical_transformer, categorical_features)])
rf_model_smote = SMOTEpipe([('preprocessor', preprocessor),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))]), y_train)
print(f"RF model score: {rf_model_smote.score(X_test, y_test):.3f}")

RF model score: 0.865

While our accuracy has suffered a bit, we see from the classification report below that our recall for the minority class has gone up to 0.11.

print(classification_report(y_test, rf_model_smote.predict(X_test)))

predictions = rf_model_smote.predict(X_test)
recall = recall_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
print(f"recall: {recall:.3f}, roc_auc: {roc_auc:.3f}")

recall: 0.109, roc_auc: 0.532

Before we move on to model parameter optimization, let’s look at feature importance. We extract the binarize feature names from the binarizing transformer, and get the feature importance values from the ‘classifier’ step of the model pipeline.

feature_importance_df = pd.DataFrame({
'feature': categorical_transformer.fit_transform(X).columns,
'importance': rf_model_smote.named_steps.classifier.feature_importances_
}).sort_values(by="importance", ascending=False)
plt.tick_params(axis='y', which='major', labelsize=9)

We see that credit_term is less important than income and credit_amount, the other two features correlated with it. Let’s rerun this pipeline, this time without credit_term. The code chunk below provides a good overview of the entire pipeline we’ve got so far.

numeric_features = ['credit_amount', 'age', 'income']
binary_features = ['having_children_flg', 'is_client']
categorical_features = ['sex', 'education', 'product_type', 'family_status', 'phone_operator', 'region']
features = numeric_features + categorical_features + binary_features
target = 'bad_client_target'
# first make training and test sets
X = df1.loc[:, features]
y = df1.loc[:, target]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
numeric_transformer = StandardScaler()
categorical_transformer = category_encoders.BinaryEncoder(cols=categorical_features)
preprocessor = ColumnTransformer(
('num', numeric_transformer, numeric_features + binary_features),
('cat', categorical_transformer, categorical_features)])
rf_model_smote = SMOTEpipe([('preprocessor', preprocessor),
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))]), y_train)
print("RF model score: %.3f" % rf_model_smote.score(X_test, y_test))
predictions = rf_model_smote.predict(X_test)
recall = recall_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
print(f"recall: {recall:.3f}, roc_auc: {roc_auc:.3f}")
print(classification_report(y_test, rf_model_smote.predict(X_test)))

RF model score: 0.886

We’re up to 22% recall! Remember that we’ve started with 2% recall, and so this is an impressive improvement.

Parameter tuning with mlflow

We’ve seen how useful feature engineering, preprocessing, and the correct method of sampling are for model performance. Another way to improve the training rate and accuracy of the model is to tune its hyperparameters. Model parameters differ across models, so it’s always useful the understand which hyperparameters would make a difference in a model’s handling of unseen data.

Lets’s take a look at the hyperparameters of the Random Forest model we’ve just trained:

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}

From among the hyperparameters given above, here are the ones that could potentially improve the machine learning process:

  • Max depth is the depth of each tree in the ‘forest’. The deeper it is, the more splits it has and therefore the more information it captures about the data. That said, a large number of splits may lead to overfitting, and so we’ll be capping tree depth at 22, which is the number of features (after binary encoding and removing credit_term).
  • Number of estimators is the number of trees in the ‘forest’. Since each tree has a subset of the features and is trained on a subset of the data, there’s no risk of overfitting with a large number of estimators. That said, a large number of trees slows down model training.
  • Max features is the maximum number of features to take into account when looking for the best split. There’s a risk of overfitting here too, but this is a metric we can track with the mlflow UI (more on that below).
  • Minimum samples per split is the minimum number of samples required to split a node. This can be represented by an integer (for the number of samples) or a float (for the proportion of samples). A larger number/proportion leads to a more constrained model, which may lead to underfitting.
  • Minimum samples per leaf is the minimum number of samples required at the base of the tree (aka the leaves). Similarly to samples per split, a large minimum sample may lead to underfitting.

There are multiple hyperparameter tuning methods: You can do it manually, by setting the hyperparameter values you’d like to test, or better, by grid search, where you set a range of values for each of the relevant hyperparameters and cycle through all the possible value permutations. A more efficient way of cycling through this permutation is random search, where you randomly search across the different combination of values, saving computation and time as the search isn’t exhaustive. In our projects we use Bayesian optimization, whereby a probabilistic model with a function that maps the hyperparameter values to the objective function (e.g. maximizing recall or ROC-AUC score). This optimization model is more efficient than grid and random search as it reasons about the probability of potential hyperparameter values to reach an optimum based on how previously-evaluated values fared.

There are multiple Python libraries that will allow you to use the hyperparameter tuning methods above, but we wanted to use a platform that allows us not only to run experiments to optimize model parameters but also to record and register those in a way that allows us to compare, track and register various experiments and models, and subsequently deploy these models in our solutions. For these reasons we’ve used mlflow, a Python package developed by Databricks, which supports the entire machine learning lifecycle, with a detailed UI that allows for easy maintenance.

One catch is that we’ll need to prepare the data for mlflow instead of incorporating it into a model pipeline.

# substantiate the binary encoder and scaler
encoder = category_encoders.BinaryEncoder(cols=categorical_features)
numeric_transformer = StandardScaler()

# transaform categorical features as binarized columns
X_bin = encoder.fit_transform(X)

# use SMOTE for numeric/binary variables
smote = SMOTE(random_state=42)

# train-test split of transformed dataset
X_bin_train, X_bin_test, y_train, y_test = train_test_split(X_bin, y)

# create copies of X_train and X_test to perform more preprocessing/scaling on
X_pre_train = X_bin_train.copy()
X_pre_test = X_bin_test.copy()

# scale numeric features
X_pre_train.loc[:, numeric_features] = numeric_transformer.fit_transform(X_bin_train.loc[:, numeric_features])
X_pre_test.loc[:, numeric_features] = numeric_transformer.fit_transform(X_bin_test.loc[:, numeric_features])

# perform over-sampling on train set only (to avoid information leakage)
X_train_bal, y_train_bal = smote.fit_resample(X_pre_train, y_train)

We’ll then train the model, logging the model, including parameters and score.

# With autolog() enabled, all model parameters, a model score, and the fitted model are automatically logged.
with mlflow.start_run():

# Set the model parameters.
n_estimators = 100
max_depth = None
max_features = 'auto'
random_state = 42

input_schema = Schema([
ColSpec("double", "credit_amount"),
ColSpec("double", "age"),
ColSpec("double", "income"),
ColSpec("double", "sex_0"),
ColSpec("double", "sex_1"),
ColSpec("double", "education_0"),
ColSpec("double", "education_1"),
ColSpec("double", "education_2"),
ColSpec("double", "product_type_0"),
ColSpec("double", "product_type_1"),
ColSpec("double", "product_type_2"),
ColSpec("double", "product_type_3"),
ColSpec("double", "product_type_4"),
ColSpec("double", "family_status_0"),
ColSpec("double", "family_status_1"),
ColSpec("double", "phone_operator_0"),
ColSpec("double", "phone_operator_1"),
ColSpec("double", "phone_operator_2"),
ColSpec("double", "region_0"),
ColSpec("double", "region_1"),
ColSpec("double", "having_children_flg"),
ColSpec("double", "is_client")

# Create and train model.
rf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, max_features=max_features, random_state=random_state), y_train_bal)

# Use the model to make predictions on the test dataset.
predictions = rf.predict(X_pre_test)

recall = recall_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

mlflow.sklearn.log_model(rf, "rf_base")

print(f"Recall: {recall}, ROC-AUC: {roc_auc}, F1-score: {f1}")

Recall: 0.2894736842105263, ROC-AUC: 0.5951185214945761, F1-score: 0.25

Let’s check out this model’s parameters:

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}

We’ll be using the Hyperopt library to define the search space for the parameters we’ve discussed. The search space includes function expressions that define the sampling strategy and scope. This plan will be executed when we call the function that searches for the parameter values, with the objective being reaching a global minimum for the chosen metric (in our case, recall).

search_space = {
'max_depth':'max_depth', 2, 22, 1)),
'n_estimators':'n_estimators', 100, 2000, 100)),
'max_features':'max_features', 3, 8, 1)),
'min_samples_leaf': scope.float(hp.quniform('min_samples_leaf', 0.1, 0.5, 0.1)),
'min_samples_split': scope.float(hp.quniform('min_samples_split', 0.1, 1.0, 0.1))

One of the main advantages of mlflow is its detailed UI for reviewing the experiments (i.e. the models with different hyperparameter permutations) and comparing them based on multiple metrics. In the optimization function train_model below, however, we have to choose one objective function to optimize by. We’ve discussed recall and ROC-AUC scores as indicative of the sensitivity of the model and the trade off between high True and False positive rates. Our challenge was that optimizing on high recall would yield a model with recall of 1.0 and ROC-AUC score of 0.5; that is, while the model didn’t yield false negatives it did generate a lot of false positives. The ROC-AUC score turned out to be a better measure to optimize on in order to maximize sensitivity and minimizing false positive rate. Note that we’re flipping the sign of the F1 score, as the mlflow model optimization runs aim to minimize the objective function (usually loss, but in this case F1-score).

def train_model(params):

rf_model_smote = SMOTEpipe([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(
))]), y_train)

predictions = rf_model_smote.predict(X_pre_test)

# Evaluate the model
recall = recall_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

mlflow.sklearn.log_model(rf_model_smote, "rf_smote")
'recall': recall,
'roc_auc': roc_auc,
'f1': f1

return {"loss":-1*roc_auc, "recall":recall, "status": STATUS_OK}

with mlflow.start_run() as run:
best_params = fmin(

100%|██████████| 32/32 [01:52<00:00, 3.52s/trial, best loss: -0.6843109682603454]

best_params = hyperopt.space_eval(search_space, best_params)

{‘max_depth’: 18,
‘max_features’: 4,
‘min_samples_leaf’: 0.1,
‘min_samples_split’: 0.30000000000000004,
‘n_estimators’: 700}

rf_model_tuned = SMOTEpipe([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(
))]), y_train)

predictions = rf_model_tuned.predict(X_pre_test)

# Evaluate the model
recall = recall_score(y_test, predictions)
roc_auc = roc_auc_score(y_test, predictions)

print(f"Recall: {recall:.3f}\nROC AUC score: {roc_auc:.3f}")

Recall: 0.605
ROC AUC score: 0.684

We can view the runs on the mlflow UI by running !mlflow ui on your notebook. You’ll get a view similar to the one below, with details about when the model was run, and any metrics and parameters logged. You’ll see that the best model optimized for highest ROC-AUC score happens to have (among) the best recall and F1 scores.


print(classification_report(y_test, rf_model_tuned.predict(X_bin_test)))

precision recall f1-score support

feature_importance_df = pd.DataFrame({
'feature': X_pre_train.columns,
'importance': rf_model_smote.named_steps.classifier.feature_importances_
}).sort_values(by="importance", ascending=False)


Final thoughts

In our Applied AI and predictive maintenance solutions we’ve managed to deliver accurate models to add value to our customers by refining every step of the machine learning life cycle: feature engineering, preprocessing, sampling methods, hyperparameter optimization, and appropriate model evaluation metrics. In the toy dataset we’ve used here we were able to improve the recall from 2% to 67% in labeling a class for which we have few observations. Our job is not done though. There’s more insight to be extracted from the model we’ve trained by investigating the effect of various features on model predictions. Many customers find model explanation just as valuable as a smoothly-run predictive model pipeline. With a model that is well attuned to the business questions and the characteristics of the data, we can confidently provide more insight suitable to our customers’ need.


Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!

Don't miss out!

Subscribe to our newsletter and stay up to date with our latest articles and events!

Subscribe now

Newsletter Subscription