Data Lineage Tracking and Visualization tool for Apache Spark
The work of a data engineer typically consists of making code (received from data analysts or data scientists) ready for production. This means it runs on Apache Spark and the code is scalable. Once the code is in production, the data engineers move on to a different task. The code written typically involves a lot of number-crunching. As time passes (weeks or months even), people analyzing the numbers come with questions on how the numbers are calculated. This becomes an impossible task for a data engineer. Hence, tools are necessary that can do this.
The process of keeping track of and visualizing the data manipulation is called data lineage. Not only is this useful for people working with the data, but with legal regulations like GDPR coming in May 2018, this will become a must. For financial institutions data lineage is one of the most significant problems when using big data tools. Mostly because they are obligated to have transparent reports, and to show how numbers are being derived. In order to make this process more streamlined, a team of developers at the Barclays Africa Group Limited have developed a tool that integrates into Apache Spark: Spline (Spark Lineage). This is what we will be discussing in this article.
The output of Spline is to store records in a database containing data lineage tool and a web UI which visualizes this information. Therefore Spline consists of two parts:
- A core library that sits on drivers, capturing data lineages from the jobs being executed by analyzing Spark execution plans.
- A Web UI application that visualizes the stored data lineages.
It solves three issues:
- Meets regulatory requirements: e.g. for South African banks (BCBS 239) or GDPR.
- Documentation of business processes and logic: Business analysts have insight to Spark processes and can provide verification whether their rules are properly applied. In addition, it is beneficial to have up-to-date documentation in which they can refresh their knowledge about a project.
- Identification of performance bottlenecks: As a data engineer, it is of key importance to know which transformations and actions are applied. This tool can give more insight and help developers with performance optimization.
Some key findings:
- Spline only registers computations on dataframes; this means there is no support for ML (Lib).
- Spline records Spark SQL statements.
- Spline records UDF’s, but they are presented in an unclear way.
Therefor Spline is only suitable for (simple) dataframe applications.
Case 1 (Dataframes: Filter and union)
The first case consists of several stages: reading a file, applying transformations, writing the result to a file, reading those result from a file, applying new transformations and saving that to a file again. The figure below shows a high level overview as seen in Spline. It immediately makes it clear that the application consists of two parts.
Spline then gives the ability to have a more in-depth look. The first figure below is the first stage, and the second figure the second stage. The first stage consists of several filters, application of a union, and writing it to a file.
Clicking on the filter, shows (on the right-hand side) which filters are applied to the dataframe.
The second stage is shown below. The select statements are considered transformations and are shown in the ‘Project’ icon.
Case 2 (Handling UDF + SQL)
For the second case, the idea was to see how Spline would record and visualize UDF functions and SQL commands. Below is a figure of the high level overview.
Showing a more indepth overview. The SQL statement is recorded and shown in the ‘Project’ logo. When clicking on the ‘Project’ logo, the UDF functions are shown in the transformations. This is not presented in a clear fashion. This seems to be an issue of the front-end app. The SQL is recorded and tracked and thus, supported.
Case 3 (Machine Learning)
For the third case, some machine learning code is applied.
First time the code ran, Spline didn’t register anything, because there was no dataframe written. Then, code was added to save the linear regression intercepted as a dataframe which is written to a parquet file. The high level overview shown by Spline can be seen in the figure below.
Now, it is very unclear what Spark is doing. The name of the Spark job is MachineLearning, which could give an indication, but
otherwise, it is unclear. Here is a more detailed view.
When looking at ‘Project’ it shows that there is a value as lrModel, which hints to a Linear Regression due to the variable names used in the code, but otherwise it’s not known what the Spark job does.
- Support for other Spark data sources – Spark Streaming, Spark ML
Spline is still considered a “young” tool and thus it doesn’t guarantee support of all data sources, transformations and actions.
- Enterprise features – Authentication (Kerberos), Authorization, User management
- Interoperability with other applications e.g. Informatica
- It lacks the ability to export to PDF
- Performance and error logging.
- Lineage on row level (although I think that is technically not possible)
How it works (High level)
A Spark application consists of the initialization of a SparkSession. The next step is to read some data from a source (e.g.: Kafka, Hive, File (CSV/Parquet)). On this data some transformation can be specified describing business logic, where actions can be performed to get the data. When specifying the transformations, Spark generates execution plans. It is in these executions where all the lineage information is written.
Within Spark there is a publish/subscribe mechanism, which accepts a listener manager. This “listener” will hold all the information that is within the execution plan. Spline will analyze the logical plan to gather lineage information. This information is exported and saved in MongoDB. It is also possible to incorporate this lineage tracking into Apache Atlas.
- Java 8
- Scala 2.11
- Spark 2.2.0
- MongoDB 3.4 (required for Spline Web UI). Make sure to configure a Windows Service.
Requirements for a quick start
- Spline Web UI executable JAR
The web UI JAR can be downloaded from here.
Run it using the following command:
The easiest and fastest way to get up and running is to clone the sample jobs.