In this blog, we will put together a small Event Driven Data Processing using EMR and Spark that is triggered by S3 put event. When a text file is added to our S3 bucket, we want an EMR cluster to spin and get the word count in the file. It will then save the output back to S3 and terminate the cluster.
Basic knowledge of Amazon services and Spark is required. The code will be done in Python.
What is EMR?
EMR (Elastic MapReduce) is an Amazon service that is built on top of EC2, and aimed for tackling distributed processing in an Elastic manner. Similar to EC2, Amazon delivers the flexibility of easily launching a cluster of any size and processing power. It can be launched already pre-installed with Hadoop and Spark ready for crunching Big Data problems. It’s a powerful and flexible service that makes handling a Big Data infrastructure much easier and in some cases cost-effective.
Like all technologies, there are pros and cons.
• Provisioning bigger or smaller clusters is extremely easy.
• No extra costs; you pay as much as you use.
• Amazon security is very sophisticated.
• Great solution for sporadic cluster calculations that does not need a cluster to always be on
• Charges add up.
• EC2 has a transient storage – in order to persist, output must be written to S3 storage.
• Clusters take a few minutes to actually spin up
Why Event Driven and Lambda?
Lambda functions is a great service by Amazon. It’s a small function that fires up depending on a set trigger event. The function runs on a limited computation power and has a hard limit on how long it can run (around 5 minutes). However, lambda function has the advantage of running immediately after the event with plenty of structured information about the specific event.
Using Lambda function, we can hook it to trigger on data input to spin up an EMR cluster and immediately process the data. This Event-Driven approach removes the need for any form of scheduling; it also scales extremely well. Since every data input can be processed on its own cluster, scaling is done automatically. Just provide the system with more data and forget about it.
The biggest disadvantage of this approach is the error detection. It’s arguably easier to trace back log runs that were invoked on schedule than having multiple smaller events that could have ran into problems. A robust logging and warning solution should be put in place.
Perhaps most importantly, this allows for the ability to write a completely server-less solution that does not require a team focused solely on Big Data Infrastructure.
Creating Lambda Function
First, we need to create a Lambda function in Amazon. We will select blueprint s3-get-object-python3 which gives us options to add a trigger for S3 event. See image below.
We select the PUT event type that will trigger on new file created in S3. We will also only process files ending in .txt for this example. You could also process files from a specific folder. You should also select the right S3 bucket.
IMPORTANT: All the services used here must be on the same region.
Next we will add our own Lambda function.
The function receives the event which will contain all the information about the event that we want. It will then create an EMR cluster job. Most of the properties are self-explanatory. Certain things to note is the steps configuration. Here we can add any number of steps to our processing. For our case we only need to submit the spark application that is on our S3 bucket. The spark just takes in a parameter that is the dataset location. In case of failure to complete the step, the cluster will terminate as specified. This is extremely important to avoid paying charges for a dead cluster.
You can also specify a bootstrap script. That can install any dependencies or applications required on the cluster. For our example it’s not needed, but left for reference.
You should save the lambda function, and create a test to verify that the lambda function works. The test will spawn a cluster so be careful of that.
The spark application
Next, we need to add our spark script to our S3 bucket. EMR does not have a persistent storage on its own, it must depend on S3 for storage or the data will be lost. So our spark script will be on S3 and copied to the cluster by our Lambda function.
The file name is wordcount.py
It’s a very simple spark application that only counts the occurrences of words in a text file. It will save the output to an S3 bucket to a folder named wordcount_output. So create this folder in your S3 bucket and copy over the script.
Spin it up!!
That’s it! We’re done. Upload any .txt file to your bucket. The lambda function will fire up and spin up a cluster. The cluster will automatically run the spark job on the data and save the output in the specified folder. Easy right?
You can view the logs from lambda function and from EMR to debug any problems that might happen.
This pattern is extremely powerful and flexible. We have only scratched the surface of what can be done using this technology. The ability to have a fully automatic and extremely scalable and cost-effective big data pipeline is very attractive to organizations. Especially for those organization with sporadic big data problems or time-sensitive problems.