Cloud Data Fusion
A 'Google Next 2019' technology
Recently, I had the pleasure to go to Google Next 2019, an annual conference in San Francisco with a focus on the Google Cloud. During the conference many exciting announcements were made, especially within the Big Data, AI and, Analytics space.
I noticed that quite a few of the announcements focused around building data pipelines, or ML models, without writing any code. In a series of two blog post, I will use a few of the new features to build out a churn prediction pipeline, writing as little code as possible.
Customer churn happens when customers stop using your product or service. The goal of churn prediction is to identify customers that are likely to churn. This insight can be used to prevent churn from happening, by taking preemptive measures.
In this article, I will talk about what Cloud Data Fusion is and how you can build a data pipeline with it. Personally, it is one of the Google announcements that excited me the most. It allows you to develop, deploy, and monitor data pipelines without having to write any code. Similar to many other ETL tools, it allows you to develop pipelines in a drag-and-drop manner. Data Fusion is based on the open source project CDAP. A product developed by the company Cask, which Google acquired about a year ago.
Data Fusion Concepts
Before we begin, we need to make a quick overview of the main Data Fusion concepts:
- sources/sinks: Sources and sinks are connectors to various data stores, such as relational databases, messages queues, or file storage systems. Data Fusion comes with a large number of connectors, out-of-the-box, and it may come as no surprise that many of the Google Cloud native storage services are supported.
- transforms: Transforms are a way to manipulate your data within Data Fusion. An example of a transform would be parsing lines in a CSV file, or obfuscating a field in a record.
If the default installation of Data Fusion doesn’t have the source, sink, or transform that you are looking for, it might be available in the Data Fusion Hub. The Hub is a marketplace where you can find various plugins and example data pipelines. Although Data Fusion is aiming to be a code-free solution, plugins that enable you to run arbitrary code within your pipeline are still available, through the Apache Spark plug in, for example.
Building a Data Pipeline
Getting started with Data Fusion is pretty straightforward. You simply follow the steps as described in this documentation. Just enable the API, create an instance, and setup the correct IAM permissions. The whole process takes less than 20 minutes.
Before we can start to build our first data pipeline, we’ll need to make the Telecom Churn dataset available to Data Fusion. I chose to upload the CSV file to a Google Cloud Storage bucket, but there are many other data sources available, such as various databases or message queues.
After we’ve made the data available on Google Cloud Storage, we can finally start using Data Fusion to build a data pipeline. We’ll start by using the Wrangler UI. The Wrangler tool is really one of the most elaborate transformations in the Data Fusion ecosystem. It allows you to transform and inspect your dataset in an intuitive manner.
In order to load your data into Wrangler you’ll need to select the Wrangler UI from the Data Fusion navigation bar and select the correct CSV file from Google Cloud Storage.
Once you’ve selected the CSV file, Wrangler will display a number of rows from the dataset.
You’ve probably already noticed that all of the data is loaded in a single column named ‘body’. Let’s parse the data in CSV format for a better overview.
The ‘body’ column is no longer needed, so let’s drop it. Additionally, it’s a good practice to normalize the names of your columns. Typically, this means no special characters or white-spaces and all in a single case. Wrangler has a convenient directive for this, called cleanse-column-names, which can be enabled through the UI or by using the Wrangler CLI.
Furthermore, you might have noticed that all columns have the same datatype. By default, a column will have the string datatype. This can easily be corrected by using the Wrangler UI, or CLI. Please note that some of the columns contain only values ‘yes’ and ‘no’. If we want to convert these columns to a Boolean datatype, we would first have to use another directive, called find-and-replace.
Next to the Wrangler data view, there’s a view called insights. The insights view enables you to quickly build a number of visualizations in a drag-and-drop manner. We are going to use the insights view to determine if the dataset contains any highly correlated fields. In order to do this, you need to add the quantitative wildcard fields into the Encoding pane twice (see screenshot below). This will show you scatter plots of the combination of all the quantitative fields.
We found that the dataset contains a number of highly correlated fields, such as total_day_minutes and total_day_charge. Depending on the type of classification algorithm you are using, highly correlated fields might affect the performance of your model in a negative way. We’ll remove a number of fields from the dataset to prevent having any highly correlated fields.
The final script of directives in Wrangler will look as follows:
parse-as-csv :body ‘,’ true
set-type :account_length integer
set-type :number_vmail_messages integer
set-type :total_day_calls integer
set-type :total_eve_minutes double
set-type :total_eve_calls integer
set-type :total_night_minutes double
set-type :total_night_calls integer
set-type :total_intl_minutes double
set-type :total_intl_calls integer
set-type :customer_service_calls integer
set-type :churn boolean
find-and-replace international_plan s/^yes$/true/g
find-and-replace international_plan s/^no$/false/g
set-type :international_plan boolean
find-and-replace voice_mail_plan s/^yes$/true/g
find-and-replace voice_mail_plan s/^no$/false/g
set-type :voice_mail_plan boolean
After completing our Wrangler transform, we end up with a pipeline that looks like this:
We want our data to be available in BigQuery, so we need to add the BigQuery sink:
After the BigQuery sink is connected, we have to supply some additional properties, such as reference name, dataset and table.
If you look at the left-hand side of your screen, you can see that the input schema has been automatically added. The output schema hasn’t, although by default it will use the same schema. Once completed, we can see our complete data pipeline.
Finally, it’s time to deploy and run our pipeline. Data Fusion starts by provisioning a Dataproc cluster to run your data pipeline on. Once your job has finished, it will automatically shutdown and destroy the cluster in order to save resources. While your pipeline is running, you have some tools to inspect progress, see how many elements have been processed, errors that might have occurred and there’s an overview of all the log messages for debugging. After the job has run, we can see that the dataset has appeared in Big Query.
In this blog post we used Cloud Data Fusion to build an ETL pipeline without writing a single line of code. We loaded a CSV file from Google Cloud Storage, made some transformations and loaded the data into BigQuery.
I was really impressed by Wrangler because it made transforming the CSV file a really rapid and straightforward process. I felt like the visualizations tab could be improve a little. It took me a while to figure out how to create a bunch of scatter plots, without having to make a manual combination of each of the individual fields.
I believe Cloud Data Fusion is a great tool for any data professional in the organisation who wants to rapidly develop data pipelines, without needing thorough knowledge of cloud, data sources, or software engineering.
In the next blog post, we will have a look at AutoML Tables and build a churn prediction model using the data we’ve loaded into BigQuery.