As you can read from the posts by my colleagues Lawrence Lee and Bas de Bruijn, we attended the Spark Summit in Dublin in October. In this post, I’d like to summarize my highlights from the Summit, as well as briefly talk about Spark workflows and Large-scale batch ETL jobs.
From my perspective of a data engineer, I mostly focused on the Data Engineering, Developer, Spark Ecosystem and Enterprise tracks, skipping the seemingly more popular topics of Artificial Intelligence, Machine Learning and Deep Learning.
Interestingly, all of the challenges we’re facing at my client’s project in implementing data pipelines with Spark, are also challenges that the big SF Bay area companies are struggling with; and much of it is far from a working solution yet. I was actually surprised to realize this. Even though those companies seem to be far ahead of the curve, they’re quite close at the same time.
At my client’s project we’ve faced a particular data engineering problem, which made it so much more interesting to attend the 1-day training before the actual conference and the various talks. I learned a lot and came home with many pathways for further exploration, but unfortunately not with a ready-made solution.
Of all talks I attended, let me highlight two of them:
LinkedIn developed Azkaban as a job scheduling and orchestration tool, we’re using it at my client as well, though we’re planning to migrate to AirFlow for the easier and more extensive API. What is interesting to see is that LinkedIn uses Azkaban in a more advanced way with a Domain Specific Language (DSL) that they developed on top of it (and seemingly not part of Azkaban itself yet).
The DSL allows LinkedIn to configure the jobs in the flow in 1 json config file, and for testing they can use an overlay config that only changes the input and output paths to point to test data sets.
My main takeaway is that even a large company like LinkedIn is still struggling with getting the basis of testing pipelines end-to-end in place.
Emma Tang from Neustar, shared interesting behaviors of Spark when dealing with really big datasets, with nice examples of how error messages are pointing you in the wrong direction.
Neustar uses clickstream data to analyze the funnel of customers seeing ads (‘impressions’), clicking on them (‘clicks’) and finally buying something (‘conversion’). They have impressive numbers to process: 250 billion impressions, 20 billion clicks and 50 billion conversions. And their largest challenge is to join these 50B x 250B row datasets.
A couple of interesting points made:
- the Spark driver can run give Out of Memory errors while it still has lots of heap left; increasing memory is not going to help, as there is no shortage of heap. Root cause: the driver is hitting the maxInt limit in trying to keep the status per mapper. Solution: reduce the number of partitions.
- Garbage Collection (GC) is Java’s default behavior for managing memory in the JVMs that make up a Spark job, but these JVMs are typically so short-lived that GC can be a waste of time. In case of very large heaps (32 GB+), GC and its ‘stop-the-world’ impact might even take so long that the executors stop responding to network requests from other executors. Solution: set the GC period so high that it will never be met within normal runtimes for the stage.
- As seen elsewhere: a very low number of customers have very large numbers of interactions (100k vs. 50). This long tail causes data skew, which gives very tough engineering challenges. Emma presents many solutions and optimizations
The hype nowadays is all about AI, machine learning and deep learning, and Data Scientist still seems to be the sexiest job in the industry, which was clear as well from the sheer amount of talks on these topics.
A year ago, I joined Anchormen to refocus my career from management back into engineering. The Spark Summit convinced me once more that Data Engineering is underrated and should be as sexy as Data Science: developing, testing, deploying and maintaining reliable pipelines involves many engineering challenges, and you can make the difference with your ability to resolve them.
At my client’s project, we’re facing a problem with a job that takes hours to complete, but interestingly, it only shows partial symptoms of data skew: we see 2-3 tasks (out of 200) that take longer than the others, but there are 2 noticeable difference with data skew: 1) the runtimes of the stragglers are not 2-3 longer than the rest, but orders of magnitude longer (20 mins vs. 15 sec). Interestingly, we don’t see this ratio in the data: we do have larger customers but so much larger. And moreover, telling by the shuffling, the stragglers even have data sizes that are similar to the rest.
Apparently, we’re somehow exceeding a threshold that results in a non-linear behavior, probably including lots of disc I/O. But as we’re using PySpark and Python udf’s, it’s not easy to detect what exactly is going on inside the straggler task.