Two weeks ago some of my colleagues and I attended the Spark+AI Summit in London. The conference was all about Apache Spark and the Databricks Unified Analytics platform. I’m excited that the focus of this year’s conference shifted from purely engineering to include topics on data science and AI as well. With that in mind, I wanted to talk about some of the most exciting news this year.
A bit of history
Reynold Xin, one of the founders of Apache Spark, opened the summit with a great keynote, which you can watch here. He gave an overview of how Apache Spark evolved throughout the years, all the way from how it started with the Netflix challenge, to the early release of the core engine, the Dataframe API and later the Structured Streaming API. It was a very interesting talk that gave some insight into what got them to work on this project.
The project aims to substantially improve the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. It does this by enhancing both data exchange and the execution model.
I can say that from my own experience with PySpark there are severe performance penalties of using UDFs, which is largely due to the row-at-a-time data exchange between Spark and Python. I was happy to see the vectored execution possibilities in the Spark 2.3 release, which were a great improvement. As for the Spark execution model, Reynold mentioned that there’s a fundamental incompatibility between the original Spark execution model and the nature of distributed machine learning models. Typical data prep workloads are embarrassingly parallel, so computation can easily be scaled out to multiple tasks, which are completely independent. On the other hand, many machine learning frameworks require coordination between tasks, so tasks are not independent. This has implications for the fault tolerance of your system, since it requires you to rerun all tasks that have inter-dependencies when one of the tasks crashes.
The above two challenges are exactly what Project Hydrogen aims to resolve – optimize data exchange and unify Spark and ML execution models. We’ve already seen some of these improvements, such as vectored UDFs, where UDFs can be defined as operating on chunks of data rather than one-at-a-time. And Gang Scheduling, which allow you to schedule a task in an all or nothing fashion. We will likely see more of these developments in future releases.
The second to give a keynote was Matei Zaharia, one of the original creators of Apache Spark. He spoke about MLflow, an open-source platform Databricks introduced at the Spark Summit last June. You can find his talk here. The platform is built to manage the end-to-end Machine Learning lifecycle. The ML lifecycle typically consists of the following steps: raw data collection, data prep, model training, and model deployment. Each of these steps comes with its own framework, tools and technical challenges (such as tuning your models, scaling the training ML models and generating new predictions once deployed).
Other challenges in the ML lifecycle are non-technical, but rather of a more organizational nature. The team responsible for developing the ML models, for example, is often a different team than the one having to deploy, monitor and support it in production.
MLflow has a number of components that try to tackle some of these challenges and I was really impressed with the work that was done so far.
A big topic during the Spark Summit was productionizing Machine Learning models. Several companies discussed the technical and organizational challenges that Matei Zaharia mentioned during his opening keynote.
One of the challenges Sim Simeonov (Swoop) talked about was that of collaboration between team members during the ML lifecycle, in particular when it comes to working with the Apache Spark cache function. According to him there are a number of issues:
- Hard to reuse caching cross-cluster when multiple team members work on the same data but on different clusters.
- Problematic lifecycle management of data frames such as when to cache or unpersist.
- No control over the freshness/staleness of the data, because a data source might have been updated after you called cache on your data frame.
Sim introduced spark-alchemy, a framework he’s been working on, which might be open-sourced in the near future. The framework comes as a response to the issues stated above and delivers some great additions to the Apache Spark project. I was really excited by the freshness/staleness feature, since I ran into this problem myself a number of times.
Another interesting talk in this area was by Brandon Hamric and Alex Meyer from Eventbrite. They explained how they standardized part of the ML lifecycle by providing a code template that team members can use to define some of the steps commonly found in the ML workflow (data prep, feature extraction, model training and prediction for both batch and streaming). Using a common interface for all of these steps allows the data scientists at Eventbrite to package and deploy their own models.
There were a few more presentations by open-source committers that sparked my interest:
Cesar Delgado and DB Tsai (Apple Siri) discussed some of the contributions they made to the interaction between Spark and the Paquet data format, such as schema pruning and predicated push-down on nested columns, which will be part of the upcoming 2.4 release. I was surprised to hear that the Siri team is using a very wide schema of around 2000 columns, which is highly nested and only has 5 top-level columns. The reason they went for this type of schema was twofold – the shape of the data contains information that otherwise would get lost, and it gives them some benefits in handling schema evolution.
Adaptive Execution (AE) was another interesting contribution that was presented during a talk by Yucai Yu and Yuming Wang (Ebay). AE can optimize your execution plan dynamically, which could greatly simplify the optimization of your configuration. AE reduces the need to tweak some of the Spark parameters to optimize your jobs, such as executor cores and memory or shuffle partitions. It also improves the performance of joins by optimizing the join strategy (it uses broadcast joins over sort-merge joins whenever appropriate).
Furthermore, I went to a few presentations from Holden Karau (Google), who is one of the core committers of the Apache Spark project. What I found particularly interesting is that she is streaming the weekly code reviews from Apache Spark on Twitch. So, if you’re interested in following the latest developments from the project, I would recommend to check it out.
In conclusion I was very impressed by the content this year and I would highly recommend checking it out or following the developments. Most of the talks are already online and can be viewed either on the Databricks’ Youtube channel, or through the schedule page on the Spark Summit website.