Streaming Big Data Training

Streaming Big Data Training

Due to the COVID-19 our training courses will be taught via an online classroom.

Receive in-depth knowledge from industry professionals, test your skills with hands-on assignments & demos, and get access to valuable resources and tools.

This course is a deep dive into streaming technologies used for real-time processing applications. The lessons that are presented here focus on Kafka, which provides a scalable solution for decoupling data streams, Spark streaming structured data model, Airflow for scheduling and Data architectures (Lambda and Kappa). After this course, you will be able to design high-quality streaming applications such as processing raw data and write the cleaned data to a MySQL database or transfer the data from a MySQL database to a Postgres database. This course is ideal for data engineers who want to master streaming applications. As requirements, experience with programming languages such as python or java and Spark are required.

Are you interested? Contact us and we will get in touch with you.


Get in touch for more information

Fill in the form and we will contact you about the Streaming Big Data training:
Academy: Streaming Big Data
I agree to be contacted *

About the training & classes

The Streaming Big Data training is split in 4 days. Click below to see a detailed description of each class: 

Spark Streaming

In this training you will be introduced to Spark’s structured streaming APIs. Participants are introduced to streaming concepts such as event time, late data, windowing, and watermarking. During the practical session participants will solve several streaming queries regarding order (sales) data using Spark and Kafka.

The training includes theory, demos, and hands-on exercises. After this training you will have gained knowledge about:

  • Previous and current streaming APIs in Spark
  • Spark structured streaming data model
  • Considerations concerning streaming query output modes
  • Event time and late data
  • Windowing and watermarking to solve late data issues
  • Hands-on solving structured streaming queries

The Kafka training aims to provide an overview of the Apache platform. Participants will learn about Kafka terminology and how Kafka provides a scalable solution for decoupling data streams. Topics such as partitioning and message guarantees will be addressed. During the practical session participants will use a Dockerized Kafka broker to explore basic consuming and producing followed up by a more complex change data capture (CDC) scenario.

The training introduces Kafka concepts and theory followed up by hands-on exercises. After this training you will have gained knowledge about:

  • The problems Kafka solves
  • Kafka terminology and internals
  • Partitioning and scaling Kafka
  • The various message guarantees provided by Kafka
  • Kafka security and ACL options
  • Schemas and schema registry
  • Basic Kafka consuming and producing
  • Change data capture and Kafka
Data Architectures

Learn how to setup different (big) data architectures and the design principles behind them and the trade-offs between them. This lesson explores the Lambda and Kappa Architectures and lets students build a small scale prototype for each.

Scheduling Airflow

This training aims to give an overview of what Apache Airflow is, how it works, and how it can be used in practice.

First, we will discuss the core architecture components, including the metadata database, scheduler, executor and worker nodes and how they interact in a single and multi-node architecture. We then move to Airflow-specific concepts such as directed acyclical graphs (also referred to as DAGs), operators, tasks and the task lifecycle in a workflow. Finally, we discuss some special additional functionalities of Airflow, like hooks, connections and XComs.

After this theoretical overview, we gain hands-on experience in a two-part lab session. In the first lab, we set up a workflow to receive raw data from a client and write the cleaned data to a MySQL database, where we can then query the data to generate sales reports. For the second lab, we use Airflow hooks and connections to transfer the data from a MySQL database to a Postgres database.

The training includes theory, demos and hands-on exercises: After this training, you will have gained knowledge about:

  • Various Airflow use cases and applications
  • Architecture components: metadata database, scheduler, executor, workers
  • Single vs. Multi-node architectures
  • Directed Acyclical Graphs (DAGs)
  • Operators
  • Tasks and the task lifecycle
  • Hooks
  • Connections
  • XComs
  • Lab session to get hands-on experience writing DAG files and taking advantage of the Airflow web UI for scheduling data workflows