automl tables

AutoML Tables

A 'Google Next 2019' technology

This is the second article in the two-post series where we are playing around with some of the tools announced at Google Next 2019. Our goal is to build a churn prediction pipeline, writing as little code as possible.

In the previous blog post we built an ETL pipeline using Cloud Data Fusion, without writing a single line of code. We took a CSV file stored on Google Cloud Storage, applied a couple of transformations and loaded the data into BigQuery. Voila!

In this article I will be working with AutoML Tables. Let’s get started.

AutoML Tables

AutoML Tables is the latest in a range of supervised learning services that Google Cloud Platform offers. It allows you to train and deploy state-of-the-art machine learning models based on a structured dataset. It uses a predefined workflow that you must follow when building your models. The process consists of loading your data, preparing your data, training a model, evaluating your model, and finally deploying your model.

Let’s walk through the steps in the workflow one by one to see how AutoML Tables tries to create a machine learning model from scratch.

Loading a Dataset

The first step in creating a machine learning model using AutoML Tables is creating a curated dataset. There are currently two options for selecting your datasource – a CSV file on Google Cloud Storage and data stored in BigQuery.

Initially, I tried to import the Telecom Churn dataset using CSV, but I ran into an error, because the file didn’t use the correct formatting. I ended up using the BigQuery dataset we prepared in the previous blog post, since the error message and documentation wasn’t really clear on how to resolve the issue. I did notice afterwards that updating the CSV file headers so they don’t contain spaces, solved the problem.

churn dataset

After importing your data, AutoML will attempt to determine the schema automatically. You’ll be redirected to a page where you can visually inspect the schema of your dataset and make changes if necessary. When looking at the schema, you might be surprised by the datatypes you find there. Instead of the typical datatypes, such as int, float, bool, text or date, AutoML has its own datatypes. These datatypes are more inline with how the columns should be interpreted and used during model training. In the churn dataset used here, I only have numeric and categorical columns, but there are a number of other datatypes, such as timestamp and text.

data types

For data loaded from BigQuery schema inference seems trivial, since BigQuery has its own schema, but there are some caveats here. The datatypes in the BigQuery schema might be open for more than one interpretation. For example, numeric values might be stored as type string, categorical data could be encoded as numeric values, or the schema might have dates which actually represent categorical data (a year column for example). I haven’t tried out these particular examples myself, but AutoML will try to infer the right datatypes for each column and will infer if NULL values are allowed. AutoML managed to correctly identify all of the columns in the dataset, so no corrections necessary in my case.

On the same page that contains the schema, you need to select the target column. The target column defines the value you want your model to predict. The target column also determines what type of model will be trained; a classification model when the data is categorical, and a regression model when the column is numeric. In this case, the target column is categorical, so AutoML will train a classification model.

Analyze

The next step in the workflow is to analyze our dataset. The tab shows some basic statistics for every column, such as percentage of missing values, the number of distinct values and the correlation with the target column.

analyze dataset

The information on this page allows you to spot mistakes in your data. A high correlation with the target column, for example, might indicate some form of target leakage. The page also allows you to detect columns with high cardinality, such as ID columns, which should be excluded from training. These columns introduce noise into your dataset, which generally will lead to degraded model performance.

In the demo from Google Next they showed it was possible to see the correlation between features, but sadly I didn’t manage to get this specific overview.

Train

Once you’re happy with your dataset, it’s time to train the model. In the training tab you only have a few options, which allow you to set the training budget and select the feature columns you want to use for training. AutoML tables will automatically divide your dataset into three splits, which will be used to perform hyper parameter tuning and model evaluation.

model evaluation

This step can take a long time, depending on how many hours you’ve selected, so it’s the perfect opportunity for a coffee break. I was pleasantly surprised to find out you receive a notification mail when training has completed. What I was less enthusiastic about is the lack of feedback you get during training. I was expecting to see some intermediate metrics about the performance of the model. The only thing I saw was the message below.

training

Evaluate

After your model has been trained, AutoML will use the test split to evaluate your model. The evaluation overview contains metrics such as precision, recall, and AUC ROC.

test split

It also shows you the confusion matrix, which is great for identifying the situations where misclassifications occur.

confusion matrix

The last thing the evaluation page shows is the feature importance. It tells you which features contribute the most to predictions your model makes. The chart should be used to double check your model and see if the features it selected make sense for your data.

feature importance

Deploy

Once this is done as well, it’s time to use it to make predictions based on real data. Here you can choose between online predictions, for real-time use cases, or batch predictions. Since we already have our data stored in BigQuery, we’ll simply select the dataset for which we want predictions. The advantage of batch predictions is that your model doesn’t have to be deployed, so that could be a potential cost saver.

input dataset

If you do need to have real-time predictions, you need to deploy your model, which is as simple as pressing a button. After your model is deployed, you can use the AutoML API to make predictions. The interface provides you with an example you can tryout using curl.

curl -X POST \
-H "Authorization: Bearer $(gcloud auth application-default print-accesstoken)"
\
-H "Content-Type: application/json" \
-d '{
    "payload": {
      "row": {
        "values": [
          true,
          286.7,
          "34",
          false,
          "115",
          "82",
          153.2,
          "3",
          4.7,
          "121",
          "3",
          "77",
          "232.6"
        ],
        "columnSpecIds": [
          "5134077186923298816",
          "522391168495910912",
          "1387082296951046144",
          "8304611324592128000",
          "3692925306164740096",
          "5998768315378434048",
          "5422307563075010560",
          "3116464553861316608",
          "1963543049254469632",
          "7728150572288704512",
          "6575229067681857536",
          "4269386058468163584",
          "8881072076895551488"
        ]
      }
    }
}' \
  https://automl.googleapis.com/v1beta1/projects/next19-pocs/locations/uscentral1/
models/TBL3948434216273838080:predict

I must say I didn’t find the example very clear. Some of the columns seem to have string values, whereas we clearly defined them to be numeric in the schema. Also, it is hard to tell which values correspond to which columns.

Conclusion

In this article we used AutoML Tables to build a churn prediction model without needing any prior knowledge of machine learning or writing any code.

While using AutoML Tables I experienced some minor issues, but all in all I found it worthwhile. I think it could still benefit from some improvements during the model evaluation phase. There are some tools which allow you to evaluate your model, but it isn’t immediately clear how to interpret the results, or what to do if a model is performing poorly.

I believe it could be a valuable tool to make your first steps with machine learning, or to see if there is value in using ML within your use case. It’s incredibly easy to build a model, so it doesn’t require much effort. I’d definitely recommend it to give it a try!

Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!