apache spark

Spark Summit 2017 Recap (2/3)

Two weeks ago, I attended the Spark Summit EU 2017 in Dublin, Ireland, together with some of my colleagues. This is the 2nd article of the series that highlights our impressions about the event. You can find the previous one here.

The summit hosted 102 talks from industry leaders, with over 1200 attendees from all over the world. A big thank you to Databricks for organizing the event!

First Day

The first day of the three-day summit consisted of a day-long training on Apache Spark tuning and best practices. This content-packed training went over five main topics:

  • Memory usage
  • Broadcast variables
  • Catalyst
  • Tuning shuffling
  • Cluster sizing

It was definitely very insightful; the training was setup in such a way that real-world problems were addressed with very practical tips on how to solve them. The presenter had an impressive amount of knowledge on the in-depth workings of Spark.

One thing that I certainly learned during this session is that there is a lot more to Spark than may initially meet the eye. It’s fairly straightforward, given some programming experience in Scala/Python, to create some Spark jobs. And when jobs run correctly it is tempting to not give them any further attention, especially given time constraints. But when you can reduce the run-time of your jobs by a significant percentage this can result in lower requirements for your cluster, as you’re utilizing your resources better. This in turn can result in a large reduction of operational costs for your cluster.

Following Days

The second and third days consisted of the conference part, where the latest developments with regards to Spark and its ecosystem were presented.

Ali Ghodsi, CEO at Databricks, mentioned that in the race towards having the best AI, the main hurdle is actually big data (paraphrasing):

“According to Google, most of their AI projects have very little focus on artificial intelligence and machine learning. The majority of their time is actually spent on big data processing. The hardest part of AI, for Google, is definitely big data processing.”

This just reinforces the notion that in order to achieve cool things with AI, we need to have a good understanding of big data technologies.

Another interesting talk came from Jer Thorp, who worked as a data artist for the New York Times, and is currently teaching NYU’s Interactive Telecommunications Program. He presents an interesting perspective on the expansion of big data, from the eyes of the people that are being analyzed by these technologies:

“What is it like to live in data? To be used. To be without agency. To be overwhelmed by complexity. We’re building amazing things, really amazing things, but amazing does not mean livable.” – Jer Thorp, Data Artist

He mentions two possible solutions: the first being an organization called the AI now institute (https://ainowinstitute.org/). This organization drafted a list of 10 points that companies and organizations can take into account when building their big data applications, in order to make them more livable. The list can be found in this PDF document.

Among other things it advocates against black box AI systems in the public domain, and running trials to ensure that existing biases are not amplified. Jer’s talk can be seen here.

All in all the summit hosted talks about diverse topics and different perspectives on not only Spark, but also big data technologies as a whole. I learned a lot, and am excited to see what the Spark Summit will hold next year!

Like this article and want to stay updated of more news and events?
Then sign up for our newsletter!

Don't miss out!

Subscribe to our newsletter and stay up to date with our latest articles and events!

Subscribe now

Newsletter Subscription