Step through this guide to learn how to setup highly available Elasticsearch cluster with Keepalived. Setting up a highly available Elasticsearch cluster with Keepalived is a pivotal step in ensuring the robustness and reliability of your Elasticsearch infrastructure. Elasticsearch, being a distributed search and analytics engine, thrives on seamless availability and fault tolerance. Keepalived, a powerful and flexible tool, adds an extra layer of high availability by providing IP failover and monitoring services.
Table of Contents
Setting up Highly Available Elasticsearch Cluster with Keepalived
So, how can you setup a highly available Elasticsearch cluster with Keepalived?
Setup Elasticsearch Cluster
Ensure you have a running cluster. Check our guide below on how to setup multinode Elasticsearch cluster.
Setup Multinode Elasticsearch 8.x Cluster
We already have an healthy three node elasticsearch cluster;
curl -k -XGET "https://es-node01:9200/_cat/health?v" -u elastic
Enter host password for user 'elastic':
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1700763329 18:15:29 kifarunix-demo green 3 3 2 1 0 0 0 0 - 100.0%
Configure Elasticsearch to Listen on All Interfaces
In order to be able to configure Elasticsearch cluster for high availability, you need to ensure that it is able to listen on a VIP.
As such, edit the ES configuration file on each cluster node and ensure that it is listening on all interfaces.
vim /etc/elasticsearch/elasticsearch.yml
In my current setup, each Elasticsearch service is configured to listen on respective node IP;
# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
#
network.host: 192.168.122.50
To ensure that Elasticsearch is listening on all interfaces, update this line and set the address to 0.0.0.0.
# ---------------------------------- Network -----------------------------------
#
# By default Elasticsearch is only accessible on localhost. Set a different
# address here to expose this node on the network:
#
network.host: 0.0.0.0
Save and exit the file.
Restart Elasticsearch cluster
systemctl restart elasticsearch
Confirm the service is up and listening on all interfaces;
ss -altnp | grep :9200
LISTEN 0 4096 *:9200 *:* users:(("java",pid=1356,fd=488))
Install Keepalived on Cluster Nodes
Keepalived is an open-source software solution that plays a pivotal role in maintaining high availability and fault tolerance in Linux-based systems. It accomplishes this critical task by actively monitoring the health of servers within a cluster and in the event of a server failure, Keepalived automatically orchestrates the seamless transition of a virtual IP address (VIP) to a healthy server in the cluster, ensuring uninterrupted service delivery.
This process is fundamental in achieving and sustaining high availability, minimizing downtime, and enhancing the overall reliability of applications and services. Keepalived is often employed alongside the Linux Virtual Server (LVS) kernel module to provide not only fault tolerance but also load balancing capabilities, distributing network traffic across multiple servers.
Install Keepalived on all your Cluster nodes using your distro specific package manager.
Ubuntu/Debian;
apt install keepalived
CentOS/RHEL distros;
yum install keeplived
Configure non-local IP binding
You need to enable Keepalived to bind to non-local IP address, that is to bind to the failover IP address (Floating IP or VIP).
echo "net.ipv4.ip_nonlocal_bind = 1" >> /etc/sysctl.conf
Reload sysctl settings;
sysctl -p
Configure Keepalived High Availability
Keepalived can operate in two primary modes;
- Active/Passive (Master/Backup) Mode: In this mode, one node serves as the active (or master) node, handling traffic for the virtual IP address (VIP). The other nodes remain in a passive (or backup) state, ready to take over if the active node fails based on their priorities. This mode is commonly used for scenarios where high availability is the primary goal, and only one node actively processes traffic at a time.
- Active/Active Mode: In this mode, multiple nodes actively handle traffic for the virtual IP address simultaneously. Each node has a separate IP address range. This is commonly used in scenarios where load balancing is a priority, and traffic distribution across multiple nodes is desired.
We will be doing active/passive configuration of Keepalived in this guide.
The default configuration file for Keepalived should be /etc/keepalived/keepalived.conf
. However, on Ubuntu/Debian systems, a sample of this configuration, /etc/keepalived/keepalived.conf.sample
, is created by default.
Thus, you can rename the sample configuration file as follows.
cp /etc/keepalived/keepalived.conf{.sample,}
By default, this is how the default configuration file looks like;
cat /etc/keepalived/keepalived.conf.sample
! Configuration File for keepalived
global_defs {
notification_email {
[email protected]
[email protected]
[email protected]
}
notification_email_from [email protected]
smtp_server 192.168.200.1
smtp_connect_timeout 30
router_id LVS_DEVEL
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.200.16
192.168.200.17
192.168.200.18
}
}
virtual_server 192.168.200.100 443 {
delay_loop 6
lb_algo rr
lb_kind NAT
persistence_timeout 50
protocol TCP
real_server 192.168.201.100 443 {
weight 1
SSL_GET {
url {
path /
digest ff20ad2481f97b1754ef3e12ecd3a9cc
}
url {
path /mrtg/
digest 9b3a0c85a887a256d6939da88aabd8cd
}
connect_timeout 3
retry 3
delay_before_retry 3
}
}
}
virtual_server 10.10.10.2 1358 {
delay_loop 6
lb_algo rr
lb_kind NAT
persistence_timeout 50
protocol TCP
sorry_server 192.168.200.200 1358
real_server 192.168.200.2 1358 {
weight 1
HTTP_GET {
url {
path /testurl/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
url {
path /testurl2/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
url {
path /testurl3/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
connect_timeout 3
retry 3
delay_before_retry 3
}
}
real_server 192.168.200.3 1358 {
weight 1
HTTP_GET {
url {
path /testurl/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334c
}
url {
path /testurl2/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334c
}
connect_timeout 3
retry 3
delay_before_retry 3
}
}
}
virtual_server 10.10.10.3 1358 {
delay_loop 3
lb_algo rr
lb_kind NAT
persistence_timeout 50
protocol TCP
real_server 192.168.200.4 1358 {
weight 1
HTTP_GET {
url {
path /testurl/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
url {
path /testurl2/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
url {
path /testurl3/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
connect_timeout 3
retry 3
delay_before_retry 3
}
}
real_server 192.168.200.5 1358 {
weight 1
HTTP_GET {
url {
path /testurl/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
url {
path /testurl2/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
url {
path /testurl3/test.jsp
digest 640205b7b0fc66c1ea91c463fac6334d
}
connect_timeout 3
retry 3
delay_before_retry 3
}
}
}
You can then edit and update the configuration to suite your cluster setup.
vim /etc/keepalived/keepalived.conf
Below is our Keepalived configurations on each node in the cluster;
Node 01;
vrrp_script check_elasticsearch {
script "/usr/bin/systemctl is-active elasticsearch.service"
interval 5
weight 10
}
vrrp_instance ES_HA {
state MASTER
interface enp1s0
virtual_router_id 100
priority 200
advert_int 1
unicast_src_ip 192.168.122.12
unicast_peer {
192.168.122.73/24
192.168.122.50/24
}
virtual_ipaddress {
192.168.122.100/24
}
authentication {
auth_type PASS
auth_pass YOUR_PASSWORD_HERE
}
track_script {
check_elasticsearch
}
}
Node 02
vrrp_script check_elasticsearch {
script "/usr/bin/systemctl is-active elasticsearch.service"
interval 5
weight 10
}
vrrp_instance ES_HA {
state BACKUP
interface enp1s0
virtual_router_id 100
priority 199
advert_int 1
unicast_src_ip 192.168.122.73
unicast_peer {
192.168.122.12/24
192.168.122.50/24
}
virtual_ipaddress {
192.168.122.100/24
}
authentication {
auth_type PASS
auth_pass YOUR_PASSWORD_HERE
}
track_script {
check_elasticsearch
}
}
Node 03;
vrrp_script check_elasticsearch {
script "/usr/bin/systemctl is-active elasticsearch.service"
interval 5
weight 10
}
vrrp_instance ES_HA {
state BACKUP
interface enp1s0
virtual_router_id 100
priority 198
advert_int 1
unicast_src_ip 192.168.122.50
unicast_peer {
192.168.122.12/24
192.168.122.73/24
}
virtual_ipaddress {
192.168.122.100/24
}
authentication {
auth_type PASS
auth_pass YOUR_PASSWORD_HERE
}
track_script {
check_elasticsearch
}
}
The configuration has three sections; The VRRP Script and VRRP Instance sections.
The VRRP script section:
check_elasticsearch
: This is the user-defined name for the VRRP script.script "/usr/bin/systemctl is-active elasticsearch.service"
: Specifies the script or command to be executed. In this case, it checks if the Elasticsearch service (elasticsearch.service
) is active using thesystemctl
command.interval 5
: Sets the interval at which the script is executed. In this example, it checks the status every 5 seconds.weight 10
: The weight assigned to the script. If the script succeeds (Elasticsearch is active), this weight (positive integer) is added to the priority of the node. A positive number on the “weight” setting will add that number to the priority if the check succeeds. A negative number will subtract that number from priority number if the check fails.
You can use other types of tracking for example:
- process tracking: Monitors the status of a specified process on a node. If the process is running, the node is considered healthy, and its priority is increased by weight value.
- interface tracking: Monitors the status of a network interface. If the specified interface is up, the node’s priority is increased by weight value
- kernel table tracking: Monitors the existence of a specified kernel routing table entry. If the entry is present, the node’s priority is increased by weight value.
The VRRP Instance section:
- vrrp_instance <STRING>: This section defines name of the VRRP instance.
state MASTER
: Sets the initial state of this node to be the master. The other possible state isBACKUP
.interface enp1s0
: Specifies the network interface associated with this VRRP instance.virtual_router_id 100
: A numeric identifier for this VRRP instance. Nodes with the samevirtual_router_id
belong to the same VRRP group.priority 200
: The priority of this node in the VRRP group. Higher priority nodes are more likely to become the master. The script’s weight will dynamically adjust this priority. Depending on the value of tracking script/process weight, ensure there is no huge GAP between cluster nodes priority values. A huge cap might cause the node with high priority to retain and not release the VIP even after the service check fails.advert_int 1
: The advertisement interval, in seconds, determines how often the master node sends advertisements to other nodes.- unicast_src_ip <IP>. Specifies the source IP address for unicast communication. In this case, an IP for the respective node.
unicast_peer
: Specifies the unicast peers, the rest of the cluster nodes, in the VRRP group.virtual_ipaddress
: The virtual IP address associated with this VRRP instance. Clients connect to this IP, which will be hosted on the master node.authentication
: Configures authentication for VRRP messages. In this case, it uses a simple password. Plain text credentials are use here hence you need to focus on securing access to your system.- auth_type: This parameter specifies the authentication type. In this case, the authentication type is PASS.
- auth_pass: This parameter specifies the authentication password. In this case, the password is YOUR_PASSWORD_HERE.
track_script { check_elasticsearch }
: Associates thecheck_elasticsearch
script with this VRRP instance, meaning the VRRP priority will be dynamically adjusted based on the script’s result.
Read more on man keepalived.conf
.
Running Keepalived
You can now start and enable Keepalived to run on system boot on all nodes;
systemctl enable --now keepalived
If already running, restart;
systemctl restart keepalived
Check the status on Master Node, which is node01 for us;
systemctl status keepalived
● keepalived.service - Keepalive Daemon (LVS and VRRP)
Loaded: loaded (/lib/systemd/system/keepalived.service; enabled; preset: enabled)
Active: active (running) since Thu 2023-11-23 16:27:02 EST; 7s ago
Docs: man:keepalived(8)
man:keepalived.conf(5)
man:genhash(1)
https://keepalived.org
Main PID: 1811 (keepalived)
Tasks: 2 (limit: 4645)
Memory: 3.0M
CPU: 26ms
CGroup: /system.slice/keepalived.service
├─1811 /usr/sbin/keepalived --dont-fork
└─1814 /usr/sbin/keepalived --dont-fork
Nov 23 16:27:02 es-node01.kifarunix-demo.com Keepalived[1811]: Starting VRRP child process, pid=1814
Nov 23 16:27:02 es-node01.kifarunix-demo.com systemd[1]: keepalived.service: Got notification message from PID 1814, but reception only permitted for main PID 1811
Nov 23 16:27:02 es-node01.kifarunix-demo.com Keepalived_vrrp[1814]: Script user 'keepalived_script' does not exist
Nov 23 16:27:02 es-node01.kifarunix-demo.com Keepalived_vrrp[1814]: SECURITY VIOLATION - scripts are being executed but script_security not enabled.
Nov 23 16:27:02 es-node01.kifarunix-demo.com Keepalived[1811]: Startup complete
Nov 23 16:27:02 es-node01.kifarunix-demo.com systemd[1]: Started keepalived.service - Keepalive Daemon (LVS and VRRP).
Nov 23 16:27:02 es-node01.kifarunix-demo.com Keepalived_vrrp[1814]: (ES_HA) Entering BACKUP STATE (init)
Nov 23 16:27:02 es-node01.kifarunix-demo.com Keepalived_vrrp[1814]: VRRP_Script(check_elasticsearch) succeeded
Nov 23 16:27:02 es-node01.kifarunix-demo.com Keepalived_vrrp[1814]: (ES_HA) Changing effective priority from 200 to 210
Nov 23 16:27:05 es-node01.kifarunix-demo.com Keepalived_vrrp[1814]: (ES_HA) Entering MASTER STATE
You can as well check the status on the other nodes;
The master node, which in our case if node01, should now have the VIP assigned.
ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:df:44:43 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.12/24 brd 192.168.122.255 scope global dynamic enp1s0
valid_lft 3047sec preferred_lft 3047sec
inet 192.168.122.100/24 scope global secondary enp1s0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fedf:4443/64 scope link
valid_lft forever preferred_lft forever
Simulate High Availability
To simulate high availability, stop Elasticsearch service on the node with high priority, in this case, node01. Only do this if it is save for you to do so!
We are stopping Elasticsearch because, in our VRRP script, we are using the status of the Elasticsearch service to guide Keepalived to take appropriate actions, such as updating the node’s priority, triggering a failover and re-assign the VIP to another node with a higher priority.
systemctl stop elasticsearch
At the same time, check the logs on the rest of the nodes;
Node02;
journalctl -f -u keepalived.service
Nov 24 06:33:13 es-node02.kifarunix-demo.com Keepalived_vrrp[12643]: (ES_HA) received lower priority (200) advert from 192.168.122.12 - discarding
Nov 24 06:33:14 es-node02.kifarunix-demo.com Keepalived_vrrp[12643]: (ES_HA) received lower priority (200) advert from 192.168.122.12 - discarding
Nov 24 06:33:15 es-node02.kifarunix-demo.com Keepalived_vrrp[12643]: (ES_HA) received lower priority (200) advert from 192.168.122.12 - discarding
Nov 24 06:33:16 es-node02.kifarunix-demo.com Keepalived_vrrp[12643]: (ES_HA) Entering MASTER STATE
It has entered master state and should now have VIP;
root@es-node02:~# ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enp1s0: mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:05:b7:40 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.73/24 brd 192.168.122.255 scope global dynamic enp1s0
valid_lft 2279sec preferred_lft 2279sec
inet 192.168.122.100/24 scope global secondary enp1s0
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe05:b740/64 scope link
valid_lft forever preferred_lft forever
On node03, its priority is still lower so, it will remain in backup state.
journalctl -f -u keepalived.service
Nov 24 06:33:13 en-node03.kifarunix-demo.com Keepalived_vrrp[12627]: (ES_HA) received lower priority (200) advert from 192.168.122.12 - discarding
Nov 24 06:33:14 en-node03.kifarunix-demo.com Keepalived_vrrp[12627]: (ES_HA) received lower priority (200) advert from 192.168.122.12 - discarding
Nov 24 06:33:15 en-node03.kifarunix-demo.com Keepalived_vrrp[12627]: (ES_HA) received lower priority (200) advert from 192.168.122.12 - discarding
If you stop Elasticsearch on both node01 and node02, then Node03 will become master and be assigned the VIP.
Send Logs to Elasticsearch Cluster VIP Address
You can now configure, whatever your agents are, to send logs to Elasticsearch cluster via the VIP address.
For example, I am using Filebeat to send logs to Elasticsearch cluster, then I have to edit the config file and define the Elasticsearch cluster VIP output;
See example;
vim /etc/filebeat/filebeat.yml
...
# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
# Array of hosts to connect to.
hosts: ["elk.kifarunix-demo.com:9200"]
# Protocol - either `http` (default) or `https`.
protocol: "https"
ssl.certificate_authorities: "/etc/filebeat/es-ca.crt"
# Authentication credentials - either API key or username/password.
#api_key: "id:api_key"
username: "USER"
password: "PASS"
...
The elk.kifarunix-demo.com is configured to resolve to ES VIP;
ping elk.kifarunix-demo.com -c 4
PING elk.kifarunix-demo.com (192.168.122.100) 56(84) bytes of data.
64 bytes from elk.kifarunix-demo.com (192.168.122.100): icmp_seq=1 ttl=64 time=0.301 ms
64 bytes from elk.kifarunix-demo.com (192.168.122.100): icmp_seq=2 ttl=64 time=0.329 ms
64 bytes from elk.kifarunix-demo.com (192.168.122.100): icmp_seq=3 ttl=64 time=0.404 ms
64 bytes from elk.kifarunix-demo.com (192.168.122.100): icmp_seq=4 ttl=64 time=0.359 ms
--- elk.kifarunix-demo.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3098ms
rtt min/avg/max/mdev = 0.301/0.348/0.404/0.040 ms
Ensure you are using Wildcard Elasticsearch SSL/TLS certificates so as to ensure that you can connect to any of the ES cluster nodes without having to every time reconfigure the agents/clients to use respective node hostname.
You can check the guide below on how to generate Wildcard SSL certs for Elasticsearch.
Once you have configured your clients with the right SSL/TLS certificates, then test the connection;
E.g for Filebeat;
filebeat test output
elasticsearch: https://elk.kifarunix-demo.com:9200...
parse url... OK
connection...
parse host... OK
dns lookup... OK
addresses: 192.168.122.100, 192.168.122.100
dial up... OK
TLS...
security: server's certificate chain verification is enabled
handshake... OK
TLS version: TLSv1.3
dial up... OK
talk to server... OK
version: 8.11.1
Perfect!
Next, run filebeat in standard output and ensure that it can establish connection to Elasticsearch;
filebeat -e
Watch for the connection.
If you see this line below, then you are all set! otherwise, troubleshoot the issue.
{"log.level":"info","@timestamp":"2023-11-25T08:51:30.695Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/client_worker.go","file.line":145},"message":"Connection to backoff(elasticsearch(https://elk.kifarunix-demo.com:9200)) established","service.name":"filebeat","ecs.version":"1.6.0"}
And that is how you can setting up highly available Elasticsearch cluster with Keepalived.