I had the opportunity to attend the Spark Summit 2017 in Dublin from 23rd to 26th October. It was an extensive 2 days, 102 talks conference. The summit was preceded by an entire day consisting of hands-on sessions which took us through an intense dive through Spark, covering how it works internally, and showing us how it makes decisions, and how we can determine which choices were made by the heuristics engine and how to influence them when necessary.
A topic which was covered frequently during convention was that of testing your applications. It’s a subject which we have spent a lot of energy on recently as well here at Anchormen, and I was happy to see that there were a few talks given at the summit by Holden Karau herself; the author of the test library that we use to test our Spark applications with. Holden’s talk can be viewed here.
“As Spark continues to evolve, we need to revisit our testing techniques to support datasets, streaming, and more. This talk expands on “Beyond Parallelize and Collect” (not required to have been seen) to discuss how to create large scale test jobs while supporting Spark’s latest features. We will explore the difficulties with testing Streaming Programs, options for setting up integration testing, beyond just local mode, with Spark, and also examine best practices for acceptance tests.” – Holden Karau, IBM
Several takeaway points from the talks on testing were:
- Testing is hard, but you need to implement it from the beginning in order to be on top of it.
- Unit tests are important, but so are functional tests.
- Spark testing differs in that massive amounts of data are needed to test, unlike normal applications that generate fake test data.
- It helps if you have access to production data that you can use to test with.
- Testing isn’t enough, you also need validation.
The idea behind validation is that you build checks in your code that log the choices made when processing or discarding data. Ideally, you should log this information in a manner that you can aggregate after runs. It is important to perform analysis of how your jobs are performing in production, and to keep track of how they process the information in order to make sure that changes in the dataset don’t trip your jobs and make sure they keep processing the data as intended.
An important part of testing your application is making sure that the code you’ve written is actually able to perform. Spark Bench can help in that regard, throughout the development pipeline:
The talk was given by Emily Curtin, a software engineer at IBM, and covered the use of Apache Spark Bench in situations ranging from test-tuning options, tracing bottlenecks in your application and benchmarking your jobs. Emily’s talk can be viewed here.
“Spark-bench is an open-source benchmarking tool, and it’s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. […] Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking.” – Emily Curtin, IBM