A Simple Leader Election Solution for Kubeadm in an AWS Auto Scaling Group

Our Journey to Kubernetes Using Kubeadm 

At Revelry, when we began our Kubernetes journey we decided to use kubeadm to build our Kubernetes clusters. As a CNCF-backed project, kubeadm will continue to have support as well as a great feature roadmap, so we thought it was a good place to start. 

At the time we began, kubeadm was fairly nascent. It wasn’t super well documented and multi-master (HA) support was still experimental. Still, kubeadm did most of what we needed and was so much simpler than other solutions we considered. It held your hand much more than trying to do Kubernetes “The Hard Way”.

The Problem: Using Kubeadm in an AWS Autoscaling Group

Kubeadm was a great solution; however, there was one major roadblock for adoption in our environment. We wanted to use AWS auto-scaling groups (ASG) for Kubernetes master nodes. Our previous solution (KOPS) used ASGs for master nodes, which allowed us to simply terminate a master node and have it immediately replaced and joined to the cluster. We wanted this functionality, but we wanted to stick with kubeadm, so we had to figure it out for ourselves.

In kubeadm the first master does the cluster initialization work with kubeadm init, and then you never have to do that again. In fact you have to make CERTAIN that it never runs again, or you will re-init your cluster and break it. Each subsequent master node should run kubeadm join to join the cluster. To solve how that would work within the context of an ASG, we needed some form of leader election among the nodes.

Why Leader Election? 

The reason we needed leader election to replicate this feature in Kubadm is tied to how an ASG works. An ASG boots all the nodes in it at the same time with the same configuration. Our node initialization BASH script needed some way to determine which of those nodes would run kubeadm init and which would run kubeadm join. There’s nothing built into ASG functionality to accomplish this. Without some way to specify a leader and have all nodes aware of which node is the leader there is no way to guarantee kubeadm init is only run once and only once. If this isn’t fully making sense, keep reading, we will walk through it in the next section.

Solving the Leader Election Problem

Leader election can be complicated. The reason that etcd (the datastore at the heart of Kubernetes) exists is that CoreOS needs to do leader election. We considered using this, but we really like simple here at Revelry and wanted to see just how simple we could make it. 

A quick search of simple leader election turned up a very simple method written in Python

This simple method was a great place to start. We really only need this leader election to happen once, so a very simple system is fine. And we didn’t want a lot of dependencies. We do all of our other cluster initialization stuff with a single bash script that we upload via cloud-init using Terraform. So we decided to re-write this into our bash script as a bash function within our cluster init script. 

The Leader Election Bash Script Explained

Our script won’t work for you out of the box since our environment is likely somewhat different than yours, but it should give you an idea of how to build your solution and some bits and pieces to crib from if you like. I will walk through the relevant parts of our script for you. We are going to go thru in the execution order of the script to best explain it.

Step 1. Check to See if a Leader Has Already Been Elected and Has Initialized Kubernetes

####

# Checks to determine if we are the first master via simple leader election, then run appropriate function

####

# check S3 to see if a master has already initialized, if so, run join and exit the script.

if [ `aws s3 ls s3://$PKI_BUCKET/ | grep discovery-token-ca-cert-hash | wc -c` != 0 ]
then
join_master
exit
fi

First, we to check if kubeadm init has already been run. This is very important because if a node reruns kubeadm init we will break the cluster. As you will see later in the leader election part of the script, whichever node is elected the leader will write out a file generated by kubeadm init into an S3 bucket. If that file exists a leader exists and it is not us so we don’t need to check any further. We then run kubeadm join and join the cluster as a member. 

Step 2. Initial Environment Setup

# if there is no S3 file indicating the master has finished initializing, check wether we are the leader

# if we are, run init, if not, run join

 # This is the endpoint where we get our ec2 instance metadata

EC2_ENDPOINT="http://169.254.169.254/latest/meta-data/"

# what is this instance's ID?

MY_INSTANCE_ID=`curl -s $EC2_ENDPOINT/instance-id`

# which AZ is this instance in?

MY_AZ=`curl -s $EC2_ENDPOINT/placement/availability-zone`

# trim off last digit of availability zone to set region; export it so we dont have to append it to all the aws commands

export AWS_DEFAULT_REGION=$${MY_AZ::-1}

# which ASG is this instance in?

ASG_NAME=`aws ec2 describe-tags --filters Name=resource-id,Values=$MY_INSTANCE_ID Name=key,Values=aws:autoscaling:groupName | jq -r '.Tags[0].Value'`

Above, we set some variables that let us check the EC2 endpoint for instance metadata and get the ASG that we are a member of. We use JQ to process the JSON that the endpoint returns. 

Step 3. Ensuring the ASG is Correctly Populated 

# how many members should the ASG have?

# I have hardcoded this because there is no way to get this from ec2 tags. We can flow this thru from terraform vars if needed

ASG_MEMBER_MAX=3

# function for accessing the list of ASG members in a space delimited string

function get_asg_members() {
local ASG_MEMBERS=$(aws ec2 describe-tags --filters Name=key,Values=aws:autoscaling:groupName Name=value,Values=$ASG_NAME | jq -r  '.Tags[] | .ResourceId')
echo "$ASG_MEMBERS"
}

# how many members are in this ASG?

ASG_MEMBERS_TOTAL=`echo $(get_asg_members) | wc -w`

# is my ASG correctly populated? If not, wait

until [ $ASG_MEMBERS_TOTAL == $ASG_MEMBER_MAX ]

do
sleep 10
ASG_MEMBERS_TOTAL=`echo $(get_asg_members) | wc -w`
done

echo "ASG is fully populated, electing leader..."

Above, we define how many members the ASG should have in total (3, in our case). Then we get a list of its current members and compare how many members should be in the group to how many are. 

This allows us to get around two problems:

  1. We want to be sure all of our nodes exist in the ASG before we elect a leader.  
  2. If there are MORE nodes in the ASG for whatever reason, we wait for them to go away before performing leader election. 

The second case is needed because a terminated instance in AWS will stick around for an undetermined amount of time until EC2 decides to pull it out and that can cause some issues if we don’t account for it. We wait loop until there are three members in the ASG. At that point, we can assume the ASG has settled down and we are ready to run leader election.

# get the list again and sort it alphabetically then get the first member, which we are calling the leader

ASG_LEADER=`echo $(get_asg_members) | tr ' ' '\n' | sort -n | head -n1`

# If we are the elected leader, run init, if not, run join
if [ $MY_INSTANCE_ID == $ASG_LEADER ]
then
echo "I AM THE LEADER!"
init_master
else
echo "i am not the leader."
join_master
fi

Step 4. Leader Election

Finally, our simple leader election. What we do here is get a list of the instance ID and sort them alphabetically. Then we take the first member of the list, and that is our leader! Finally, we can run join_master or init_master, depending on our status as the leader.

I told you it was simple. We just had to do a little work to get to the simple part. 

After Leader Election: Making Sure They Clean Up After Themselves

Automatically Cleaning Up etcd When a Master is Replaced

Now we were able to elect a leader and come up with a three-master node ASG, with one leader and two members. This is great. However, the first time I tested killing off a master node and having AWS replace it I noticed the new master node did not join the cluster. Hmm.

A quick check of the logs revealed the problem. Kubeadm does some “pre-flight checks” before it will add the new node to a cluster. One of those is checking the health of etcd and one of the services that the master nodes ran. 

Unfortunately, since we terminated the node in an unclean manner, that health check failed. Etcd contained unhealthy members, so Kubernetes bailed out. We needed to clean up after ourselves on shutdown.

Kubeadm Reset

The command we need to run is kubeadm reset. It has four phases, two of which we care about:

  1. Remove-etcd-member – as one might expect, this removes the node from etcd
  2. Update-cluster-status – this removes the node from a configmap called ClusterStatus that kubeadm join also uses

I tried running kubeadm reset before termination manually and that seemed to do the trick. Great! Now I just need to automate it. 

Automate With SystemD

Ah, everyone’s favorite init system, Systemd. For my use, I needed to run the two commands above while etcd was still running on the system (kubeadm runs etcd in docker by default). I also wanted the reset logic only to happen on a shutdown or terminate, but NOT on a reboot. 

This turned out to be the most fiddly part of this solution and required a lot of trial and error. I ended up with a two part solution: a kubeadm reset script and a kubeadm systemd unit file.

How To Configure The SystemD Unit File

First, the unit file. 

[Unit]
Description=Run kubeadm reset at shutdown
After=network-online.target docker.service kubelet.service
Before=poweroff.target halt.target

[Service]
Type=oneshot
RemainAfterExit=true
ExecStop=/root/kubeadm-reset.sh
User=root
Group=root
TimeoutStopSec=120

[Install]
WantedBy=multi-user.target

Explaining systemd is beyond this post, but I’ll point a few things out about this service. This is a one-shot service, which means it will only run once and exit. That allows us to not define an ExecStart

The After section defines what services kubeadm reset needs in order to run. Specifically, the network needs to be up (network-online.target), docker needs to be running (docker.service) and the kubelet — which starts etcd in our case — needs to be running. 

The Before section defines when we should run, in this case before power off or halt, but not reboot. I ran into a problem here. Theoretically, since I have omitted reboot.target from this list, this service should NOT execute on reboot. But, it did. So I had to handle that in the following shell script: 

#!/bin/bash

REBOOT=$( systemctl list-jobs | egrep -q 'reboot.target.*start' && echo "rebooting" || echo "not_rebooting" )

if [ $REBOOT = "not_rebooting" ]; then
   /bin/kubeadm reset phase remove-etcd-member --kubeconfig=/etc/kubernetes/admin.conf --v=5 && /bin/kubeadm reset phase update-cluster-status --v=5

fi

This checks for the reboot.target in Systemctl and if we are not rebooting it runs my kubeadm reset phases. For posterity, I didn’t need to run the cleanup node phase which cleans up a bunch of data on the host because the EC2 node was going to be terminated anyway. 

Conclusion: A High Availability Multi-Master Kubernetes Cluster

Finally, with all of those pieces in place we have a fully functioning high availability multi-master Kubernetes cluster that will automatically replace a dead master node and have that node successfully join the cluster. I hope this helps you save some time, thanks for reading!

From 0 to K8s in Hours, Not Months

Don’t waste time and resources on DevOps. Our team of Certified Kubernetes Admins manage and maintain Kubernetes clusters using AWS to host applications for ourselves and our partners.

Check out Revelry Managed Cloud.

Get More Content Like This in Your Inbox

More Posts by Chris Tortorich: