Elastic Search Deployment on Kubernetes

In this blog post, we provide a guide for the deployment of Elasticsearch on top of Kubernetes. We walk through the process of setting up a production grade, 3 node Elasticsearch cluster, utilizing Kubernetes Statefulset. Next to setting up the Kubernetes objects, we extend the official Elasticsearch Docker image with an entry-point script, that adjusts system settings on the container level, to allow seamless and optimized integration with Kubernetes. For convenience, we also provision necessary Elasticsearch configuration files, during the custom image build phase. Source code for both the custom Elasticsearch Docker image and the Kubernetes objects definitions can be found here.

1. Preamble 

Since Elastic 5.x, Elastic has provisioned official - production grade - Docker images for almost all Elastic components, supporting a wide range of usage and deployment scenarios. While Docker and its CLI can be sufficient for development and test purposes, it falls short when it comes to managing typical production deployments, which goes beyond the management of individual/multiple containers on a single host.

When moving to production, it is a typical situation that multi-container workloads are deployed on a cluster of machines. In addition to provisioning and managing the networking between these containerized applications, common production requirements include high availability, zero-downtime deploys, [auto] scaling, load balancing, health checks, rescheduling failed containers, as well as service discovery and exposition; this is where container orchestration platforms fit in. Kubernetes is one of several orchestration tools for the provisioning of containerized infrastructures. 

This blog post assumes prior knowledge with Docker, Kubernetes and Elasticsearch. Additionally, it is recommended to read the documentation of the official Elasticsearch docker image to get acquainted with the image and its characteristics.

2. Extending the Docker Image

There are two reasons for extending the official Elasticsearch Docker image. First, to provide a custom entry point script that runs advanced system settings and resource usage limit commands. Second, to provide Elasticsearch configuration files. 

A. Advanced system settings and resource usage limits:

Performant Elasticsearch deployments requires the configuration of advanced system settings and increased resource usage limits. The following points are considered and resolved before starting the Elasticsearch process.

  1. Increasing the limits on the number of files a single process can have open at a time: Lucene and Elasticsearch use a lot of file descriptors or file handles. For example, Elasticsearch uses a lot of file descriptors while using many sockets to communicate between nodes and HTTP clients. Running out of file descriptors can be disastrous and will most probably lead to data loss. For that reason, it is recommended to increase the limit on the number of open files descriptors for the user running Elasticsearch to 65,536 or higher. We set “ulimit -n 65536” as root before starting Elasticsearch. This can be verified, by checking the max_file_descriptors configured for each node using the Nodes Stats API, with: GET _nodes/stats/process?filter_path=**.max_file_descriptors. Additional information can be found here and here.
  2. Increasing the limits on the number of processes/threads that the Elasticsearch process can create: Elasticsearch uses several thread pools for different types of operations. It is important that it can create new threads whenever needed. On Linux distributions, a maximum number of threads check is enforced, that ensures that the Elasticsearch process has the rights to create enough threads under normal use. The test checks if the Elasticsearch process (user) can create at least 2048 threads. We set “ulimit -u 2048” as root before starting Elasticsearch. Since the Elasticsearch Docker image uses CentOs-7.0, passing the enforced maximum number of threads check is an absolute verification that the ulimit is correctly set. Additional information can be found here and here.
  3. Locking the Elasticsearch process address space into RAM: Most operating systems try to use as much memory as possible for file system caches and eagerly swap out unused application memory. This can result in parts of the JVM heap or even its executable pages being swapped out to disk. It is vitally important to the health of a given Elasticsearch node that none of the JVM’s is ever swapped out to disk. Swapping is very bad for performance, and node stability, and should be avoided at all costs.

There are several approaches that can be followed to avoid the JVM being swapped out to disk, including completely disabling swapping, minimizing swappiness or memory locking. We follow the last approach, by using mlockall to lock the Elasticsearch process address space into RAM, preventing any Elasticsearch memory from being swapped out. This however, might cause the JVM or shell session to exit if it tries to allocate more memory than is available; mlockall can be configured as follows:

  • Setting Elasticsearch configuration: bootstrap.memory_lock: true
  • Assuring that the Elasticsearch user have enough permissions to issue the memory lock request: this can be done by setting “ulimit -l unlimited” as root before starting Elasticsearch (this only applies for .tar.gz and/or .zip installations).

To verify that the setting was applied, we can check the value of mlockall in the output from this request: GET _nodes?filter_path=**.mlockall; “mlockall" shall be set to True. Additional information can be found here and here.

Notes on setting the ulimits:

The main issue thread is still open on Kubernetes; however, several user’s comments provides various suggestions on how to set the ulimits while using Kubernetes. We chose setting the ulimits from a custom entry point script as suggested here. Other related issues can be found here and here.

  • On issuing the ulimit commands either for open file descriptors, max processes, or memory locking, we use very large values (~ unlimited); here we follow the recommended values provided by Elasticsearch in the corresponding documentation. However, for some of the specified ulimits, it is essential to have these values in place. For example, for max processes (ulimit -u), a maximum number of threads check is enforced when Elasticsearch node is starting, that ensures that the Elasticsearch process has the rights to create enough threads under normal use. The test checks if the Elasticsearch process (user) can create at least 2048 threads.

B. Providing configuration and custom entry point script:

Elasticsearch requires two configuration files (1) elasticsearch.yml for configuring Elasticsearch, and (2) log4j2.properties for configuring Elasticsearch logging. Together with the custom entry-point script, we provide the configuration files to override the defaults shipped with the official Elasticsearch image. Moreover, we assure that the provisioned configuration files and entry point script have proper access rights by the elasticsearch user.

Note: to recap this section, it is recommended to take a look at (1) the custom entry-point script and (2) the Dockerfile to related the textual description to code.

3. Setting up Kubernetes Objects

The deployment of Elasticsearch on top of Kubernetes, together with the choice of the underlying Kubernetes objects, is highly dependent on the Elasticsearch cluster design and future expectations of the size of data and requests. Different Kubernetes objects are more suitable for different types of Elasticsearch nodes and deployments; for example, a federation of Elasticsearch MASTER nodes can be deployed using Kubernetes DEPLOYMENTS, together with a HEADLESS-SRVC attached to the MASTER nodes for service discovery; on the other side Elasticsearch DATA nodes, which requires preserving their data shards across failures and restarts, requires Kubernetes Statefulset for preserving their data. We provide a setup for a cluster of 3 nodes, each having MASTER, DATA and CLIENT capabilities.

Since each node requires preserving its data-shards across failures and restarts, we use Kubernetes Statefulset for managing the deployment. Together with the Statefulset we create a Headless service, a Cluster-IP service, a Load-balancer service and a Storage Class. The Headless service is required by the Statefulset object to control the domain of its pods, and will be used by the Elasticsearch nodes (pods) for service discovery. The Cluster-IP service will expose the Elasticsearch cluster on a cluster-internal IP, that is only reachable from within the Kubernetes cluster. The Load balancer service, which provides an external/public IP, will be used to direct external traffic to one of the elastic-search nodes (pods), and the Storage class will be used as the basis for dynamic persistent volume provisioning. In the following sub-sections, we provide tricky configurations that are related to the stateful-set. The remaining Kubernetes object definitions are straight forward; knowledge of these Kubernetes objects together with referencing the source code are sufficient to grasp their mechanics and purpose.

A. Statefulset configuration - Setting access rights for mount paths: 

We use Kubernetes Statefulset with Elasticsearch Data nodes to preserve data shards across failures and restarts. This requires using Statefulset.VolumeClaimTemplates for mounting a persistent volume on each container at the used data directory - in our case the default: /usr/share/elasticsearch/data. 

Volumes that are mounted into the container, will be mounted when the container is being started, with root access rights; overriding existing permissions for the mounted paths, if they already existed in the image. To allow read/write access to the mounted volumes by the elasticsearch user, we use Kubernetes inherent FsGroup in the Pod security context. More Information on how to use the FsGroup can be found here and here

Note: In scenarios where more than one container needs shared access to a mounted persistent volume; the discussions here and here are quite beneficial.

B. Statefulset configuration - Using init-containers to increase mmap count limit.

Elasticsearch uses a hybrid mmapfs / niofs directory by default to store its indices. The default operating system limits on mmap counts is likely to be too low, which may result in out of memory exceptions. On Linux, the limits can be increased by running “sysctl -w vm.max_map_count=262144” as root. we use Kubernetes inherent init-containers to set that up. Additional information on mmap counts limits can be found in this link.

Note: init-containers are specialized containers that run before app. containers and can contain utilities or setup scripts not present in an app. image. While init-containers are suitable for increasing the mmap counts limits, they are not suitable for setting the ulimits which we had to provide through the custom entry-point script. The reason is that, the ulimit command affects only the process that is running the command and all its subprocesses; on the other hand, an init-container runs until completion in a completely different process before the normal app. container is started.

C. Statefulset configuration - Exposing environment variables to the Elasticsearch container 

In Kubernetes, when defining a container, it is often necessary to expose relevant Node, Pod, and Container information as environment variables to the container. In most situations Pod information like POD_IP and POD_NAME is required for the configuration of the applications running by underlying Pod containers.

From within the Statefulset object definition, we expose the POD_IP, POD_NAME and the Kubernetes NAMESPACE to the Elasticsearch container. We reference the exported environment variables from the Elasticsearch configuration file. More details about how to export environment variables to containers can be found here and here.

Note: In addition to providing relevant pod and namespace information, we provide the JVM heap size by exporting the ES_JAVA_OPTS environment variable. By default, Elasticsearch tells the JVM to use a heap with a minimum and maximum size of 2 GB; however, this shall always be setup in accordance with the Node specification and is highly dependent on the use of the cluster. More information on how to set the JVM heap size can be found here and here.

4. Elasticsearch Configuration

While Elasticsearch configuration shall contain settings that are node specific, we export Kubernetes pod and namespace information as environment variables in order to have a single configuration file that can be applied across different Elasticsearch nodes (containers), … and even deployments; assuming different deployments are triggered from different Kubernetes namespaces. In this section, we provide sufficient configuration parameters that shall be considered for production. More information about the configuration of Elasticsearch and Docker specifics can be found here, here and here.

Configuration parameters

  • [cluster.name]: We use elasticsearch-${NAMESPACE} as the cluster name, where ${NAMESPACE} represents the Kubernetes namespace where the cluster is deployed. Assuming different clusters are deployed within different namespaces, this can be very helpful, for example to differentiate between the different clusters on the monitoring side. It is important to make sure that different clusters have different names to prevent accidents where some nodes planned for a cluster, will accidentally join another.
  • [node.name]: By default, Elasticsearch will take the first 7 characters of the randomly generated UUID used as the node id. However, to be able to distinguish and relate nodes, their logs as well as their DTAP environment (ex. acceptance vs production) we use ${POD_NAME}-${NAMESPACE} as the node.name, where ${POD_NAME} represents the name of the pod and the ${NAMESPACE} represents the Kubernetes namespace (typically representing the DTAP environment)
  • [network.host]: In order to communicate and form a cluster with nodes on other servers, we need to bind the node to a non-loopback address. We use the ${POD_IP} for that purpose. Once a custom setting for [network.host] is provisioned, Elasticsearch assumes that the current deployment is a production deployment, and upgrades a number of system startup checks from warnings to exceptions.
  • [discovery.zen.ping.unicast.hosts]: This field allows for nodes discovery, by providing a seed list of other nodes in the cluster that are likely to be live and contactable. In that context, a hostname that resolves to multiple IP addresses will try all resolved addresses; thus, we use the name of the headless service - in our case es-discovery-svc.
  • [discovery.zen.minimum_master_nodes]: This setting tells Elasticsearch to not elect a master unless there are enough master-eligible nodes available. This setting should always be configured to a quorum (majority) of the master-eligible nodes, i.e. (n/2) + 1 where n is the number of masters. Without this setting, a cluster that suffers a network failure is at risk of having the cluster split into two independent clusters — a split brain — which will lead to data loss. More information can be found here.

Note: since we are provisioning a 3 node Elasticsearch cluster, with each node having MASTER capabilities, we set the [discovery.zen.minimum_master_nodes] to (3/2)+1 = 2. However, in case we plan to scale the Statefulset, we need to reconfigure that value accordingly.

  • [bootstrap.memory_lock]: Set to true for locking the Elasticsearch process address space into RAM. This is explained in more details in “2.A.3”.
  • [path.data]: Since we use an external mounted volume for preserving the data; the default value for [path.data] is fine. Being sub-folder of ${ES_HOME}, the default [path.data] “/usr/share/elasticsearch/data” is not recommended and can cause data loss in normal - non Dockerized - Elasticsearch deployments. For example, when re-installing or updating Elasticsearch, where Elasticsearch home path “/usr/share/elasticsearch” might be completely removed from the system. For our Kubernetes setup, this parameter shall be identical to the path the persistent volume is mounted.

Recovery Settings: To avoid excessive, un-necessary shard swapping that can occur on cluster restarts as part of the cluster recovery process, we configure the following:

  • [gateway.recover_after_nodes: 2] Specifies how many nodes shall be present before considering the cluster functional. This prevents Elasticsearch from starting a recovery process until these nodes are available
  • [gateway.expected_nodes: 3] Specifies how many nodes are expected in the cluster
  • [gateway.recover_after_time: 5m] Specifies how long to wait after [gateway.recover_after_nodes] is reached in order to start the recovery process

5. Conclusion

In this blog post, we provided a detailed guide for the deployment of a - production grade - 3 node Elasticsearch cluster, utilizing Kubernetes Statefulset. The provisioned setup has been tested on Google Container Engine, using Google’s standard persistent disk as the persistence storage for the Elasticsearch nodes, with the custom Elasticsearch Docker image being hosted on the docker hub

In the provided setup, it is possible to scale the cluster up and down with minimal overhead. Any additional node that will join the cluster by scaling up the Statefulset, will have identical configuration to the existing/initial nodes. 

Relevant source-code, including Kubernetes object configuration files, as well as the custom built Elasticsearch Dockerfile and configuration is open sourced and can be found here. Additionally, Elasticsearch Docker and deployment guides can be found here and here, respectively.

Note: in this blogpost and the provisioned setup we overpassed details about (1) resource provisioning for the defined Kubernetes pods/containers, (2) Kubernetes container readiness and liveness probes and (3) securing the access to the Kubernetes cluster and the provisioned services. These concepts need to be considered and are as important as anything that has been mentioned in that post. Finally, it is essential to consider changing the default elastic password from the configuration, in case x-pack is used (default).