Affordable automatic deployment of Spark and HDFS with Kubernetes and Gitlab CI/CD
Running an application on Spark with external dependencies, such as R and python packages, requires the installation of these dependencies on all the workers. To automate this tedious process, a continuous deployment workflow has been developed using Gitlab CI/CD. This workflow consists of: (i) Building the HDFS and Spark docker images with the required dependencies for workers and the master (Python and R), (ii) deploying the images on a Kubernetes cluster. For this, we will be using an affordable cluster made of mini PCs. More importantly, we will demonstrate that this cluster is fully operational. The Spark cluster is accessible using Spark UI, Zeppelin and R Studio. In addition, HDFS is fully integrated together with Kubernetes. Source code for both the custom Docker images and the Kubernetes objects definitions can be found here and here respectively.
As recently stated here, Kubernetes is becoming the standard for orchestration. The main cloud computing providers (AWS, Azure and Google cloud) provide Kubernetes cluster services. Azure has even gone so far as to changing the name from ACS (Azure Container Service) to AKS and is now promoted as “fully managed Kubernetes container orchestration service or choose other orchestrators”. Cisco has launched their own Container Platform including Kubernetes. Even Docker is getting support for Kubernetes disregarding the fact that they have their own orchestration technology, called Docker Swarm. As Erkan Yanar (Senior Consultant Microservice Architecture) highlighted here, “Kubernetes will become lingua franca in 2018”.
Kubernetes server set up using mini PCs (Intel® NUC)
An affordable alternative to hire some VMs in any cloud environment is to buy some mini PCs. A nice example is the Intel® NUC since a cluster of three nodes with 3 NUC5I3RYH (3xi3 Processors, 48 GB of memory and 3 TB of total hard drive) can be obtained with a modest 2000$ budget. In this blog, we set up a Kubernetes cluster using this hardware. For that purpose, an Ubuntu Server 16.04 was installed in each of the nodes as shown here.
This blog post assumes prior knowledge with Docker, Kubernetes and Gitlab CI/CD. To create a Kubernetes cluster using kubeadm, follow this guide. Additionally, a running Gitlab installation with Gitlab runners is needed. See here and here if needed. A Docker Hub account is also required, you can get one here.
Master node was called nuc01 and the other two nuc02 and nuc03; fix IPs were given to them. For security reasons, they are not shown. You can find them in the code as <IP-nuc0X> where X is 1, 2 or 3.
When you finish, you should be able to interact with the Kubernetes cluster. Test with kubectl to get nodes.
In general, images for the spark master and workers have been taken from this excellent blog; some necessary modifications have been made.
Regarding the worker image:
- Ubuntu 16.04 is used as a starting point for development and standardization purposes.
- Apache Spark 2.2.0 is installed.
- R base 3.4.3 is added including some libraries required for Zeppelin.
- Additional R libraries can be added at this point.
- Python 3.6.3 and anaconda 5.0.1 are included.
- Additional python libraries can be added at this point.
Regarding the master image:
- The worker image obtained before is used as a starting point.
- R-Studio server image version 1.1.383 is included.
- Zeppelin version 0.7.3 is also included.
- Supervisor is added to monitor and control these programs as well as the spark-master.
As is stated in this repository, a proxy is needed to direct all the requests to the Spark master and the Spark workers UI. Only minimal modifications were performed to this project.
A helm chart can be found in this repository including Kerberos for access control. However, HDFS is exposed to the Kubernetes cluster network since they found that the name node fails to see physical IP address of data nodes otherwise. We think that it could lead to a security breach when Kerberos is not used as in our case. Therefore, we remove the Kubernetes specification hostNetwork: true in the yml files and we set the connection inside the Kubernetes network. Then, we found that the data node daemons cannot join the name node since the IPs were randomly given by the Kubernetes server and they cannot be added to the hosts list to be resolved in advance. Consequently, we changed the HDFS property dfs.namenode.datanode.registration.ip-hostname-check to false to avoid this problem. Although this might lead to another security breach regarding unauthorized data node connections (see here), Kubernetes controls access to the pods. Moreover, data nodes are no longer needed to be added to the name node host file for reverse dns lookup.
Setting up Kubernetes Objects
As described in our previous blog, the deployment of Spark on top of Kubernetes is highly dependent on the cluster design. Since data should be preserved across failures and restarts, StatefulSets were chosen except data nodes of the HDFS, which are deployed using DaemonSets to ensure that every Kubernetes node runs a copy of the pod. The HDFS name node is deployed on the Kubernetes master node and name node DaemonSet is not allowed to be deployed in this node.
Together with the Statefulsets and the DaemonSets, we create a Headless service, two External-IP services and a Persistent Volume Claim. The Headless service is required by the HDFS name node to control the domain of its pods and will be used by the name nodes for service discovery. The Cluster-IP services expose the Spark UI, Zeppelin UI and RStudio UI cluster on a cluster-external IP, that is reachable from outside the Kubernetes cluster. The Persistent Volume Claim is used to provide a NFS persistent volume. The Kubernetes object definitions are self-explanatory and can be found here. Moreover, a schematic representation is shown in the image below.
Architecture of the Kubernetes cluster – Connections are initiated in the Kubernetes external IP, then they are routed to the correspondent UI depending on the port used. Spark master pod and worker pods can be allocated in any Kubernetes node, whereas HDFS Name node is deployed always on the Kubernetes Master Node. Data nodes are deployed as DaemonSets to ensure that every Kubernetes node (except the Master) runs a copy of this pod. Filesystem of the nodes is used to persist HDFS data even if pods are lost or restarted. NFS is set up on nodes NUC02 and NUC03 to persist files (for example config files or Zeppelin notebooks), which is available to any pod. Connections are represented as different arrow types and colors as shown in the legend.
Gitlab CI/CD and Configuration Variables
The main variables used in a Spark cluster configuration are the resource management values such as memory or CPU resources. Additionally, we also included passwords for the R Studio and Zeppelin UIs as Gitlab CI/CD Secret variables, which are included in the Kubernetes cluster as Secrets via environmental variables in the Gitlab runners. In this way, the only place where they are saved is in the Gitlab repository. Moreover, they can be exposed only to specific branches such as deploy.
Configuration parameters in the Gitlab repository:
- [gitlab repository name]: It is important to use different namespaces for different deployments to avoid deletion accidents, unwanted connections and for security and monitoring reasons. Therefore, the name of the repository is used by default as Kubernetes namespace where the cluster is deployed.
- [DOCKER_REGISTRY_USER]: Images are saved on Dockerhub. In order to access Docker hub, an account details should be provided. Fill here the user name.
- [DOCKER_REGISTRY_PASSWORD]: Images are saved on Dockerhub. In order to access Docker hub, an account details should be provided. Fill here the password.
- [KUBE_CA_PEM]: In the deployment phase, kubectl needs credentials with enough permission to delete and create all the Kubernetes objects previously described. Fill here the Kubernetes CA Certificate. For additional details go here. However, Gitlab 10.3 depreciated Kubernetes service integration. Therefore, we decided to add these parameters as secret variables for compatibility reasons. See here more details.
- [KUBE_TOKEN]: In the deployment phase, kubectl needs credentials with enough permission to delete and create all the Kubernetes objects previously described. Fill here the ServiceAccount Token.
- [RSTUDIO_PASSWORD]: To sign in to Rstudio, username and password should be given. Fill here the password.
- [RSTUDIO_USER]: To sign in to Rstudio, username and password should be given. Fill here the username.
- [ZEPPELIN_PASSWORD]: To sign in to Zeppelin, username and password should be given. Fill here the password.
- [ZEPPELIN_USER]: To sign in to Zeppelin, username and password should be given. Fill here the username.
Configuration parameters in the .gitlab-ci.yml file
- [SPARK_MASTER_REQUESTS_CPU]: Kubernetes scheduler ensures that the sum of all requests is less than the capacity of the node. Set here the number of CPUs dedicated to the Spark Master. If the requests are more than the capacity of the node, the pod won’t be launched until there are enough resources.
- [SPARK_MASTER_REQUESTS_MEMORY]: Similar as before but regarding memory requested for the Spark Master.
- [SPARK_MASTER_LIMITS_CPU:]: Pods and its Containers will not be allowed to exceed this limit regarding CPU. However they won’t be killed if this limit is exceeded in contrast to memory. We recommend putting the same value as SPARK_MASTER_REQUESTS_CPU
- [SPARK_MASTER_LIMITS_MEMORY]: Pods and its Containers will not be allowed to exceed this limit regarding memory. if this limit is exceeded, pod’s containers, will be killed by the kernel (see here a more detailed explanation). We recommend putting a higher value than SPARK_MASTER_REQUESTS_MEMORY variable, for example, 1G more.
- [SPARK_WORKER_REQUESTS_CPU]: Similar as before but regarding CPU/s requested for the Spark Worker. This value is passed to the start of the workers via the start-slave.sh command.
- [SPARK_WORKER_REQUESTS_MEMORY]: Similar as before but regarding memory.
- [SPARK_WORKER_LIMITS_CPU]: Similar as before but regarding CPU/s limits for the Spark Worker.
- [SPARK_WORKER_LIMITS_MEMORY]: Similar as before but regarding memory limits for the Spark Worker.
- [SPARK_CORES_MAX]: Total maximum amount of CPU/s (not from each machine) for the application. If not set, all available cores of the cluster will be used.
- [SPARK_EXECUTOR_CORES]: Amount of CPU/s used by each executor process. It must be lower than SPARK_MASTER_LIMITS_CPU.
- [SPARK_EXECUTOR_MEMORY]: Amount of memory used by each executor process. It must be lower than SPARK_MASTER_LIMITS_MEMORY to avoid the container to be killed.
- [SPARK_DRIVER_CORES]: Amount of cpu/s used by the driver process. It must be lower than SPARK_WORKER_LIMITS_CPU.
- [SPARK_DRIVER_MEMORY]: Amount of memory used by the driver process. It must be lower than SPARK_WORKER_LIMITS_MEMORY to avoid the container to be killed.
- [SPARK_DRIVER_MAXRESULTSIZE]: As stated in the Spark configuration, it is the limit of total size of serialized results of all partitions for each Spark action (e.g. collect). As suggested, it could be interesting to set a limit to avoid out-of-memory errors.
Configuration of Zeppelin interpreters
Apache Zeppelin uses interpreters as interface for specific language/data-processing-backend (see here a more detailed explanation). For example, to use Scala code in Zeppelin, you need %spark interpreter. Each interpreter is configured using settings, which can be modified per notebook. However, these settings override the data-processing-backend and they might cause problems. For example, the %spark interpreter sets the spark.executor.memory and spark.cores.max Spark variables to their default values, thus, 1G and use all available cores respectively. Since we are interested in controlling the values of these settings, they should be modified in Zeppelin as well. This can be done manually (see here) or using environmental variables (for example exporting SPARK_HOME as here). However, the latter option cannot be used with spark.executor.memory and spark.cores.max variables and the former is inconvenient. To our knowledge, there is no way to modify the default values except by editing the jar files of the interpreter (if someone knows how to do it differently, please let us know), which is hard to automate. In contrast, the configuration of these settings using the Zeppelin REST API is easy to automate by providing JSON files with the right configuration. In this way, we have configured the following Zeppelin settings:
- cores.max and spark.executor.memory from the Spark interpreter.
- HDFS URL from the file interpreter.
- Python version from the python interpreter.
Additional settings can be modified in the same way if needed.
In this blog, a detailed guide is provided for the automated deployment of a Spark cluster including HDFS and three functional UIs, namely, Spark web, Zeppelin and RStudio. Applications can be run using Zeppelin or RStudio interactively or they can be also run directly from the master node (see an example here). For PoC proposes, it has been deployed on an affordable cluster build using NUC Mini PCs.
The cluster can be easily scaled up and down by increasing/decreasing the number of workers pods. Moreover, the new workers include the external dependencies provided in the worker image. In consequence, they can be easily maintained by keeping this image up to date.
As also stated in our previous blog, we overpassed Kubernetes container readiness and liveness probes as well as securing the access to the Kubernetes cluster and the provisioned services. Take these into consideration if you use this kind of approach in a production environment.