Etcd In A High Availability Setup

What is ETCD? Distributed and reliable key-value store - this is Simple, Secure and Fast.
A key-value store has information stored in the form of documents or pages.
For example:
```
key      value
Name     John Doe
```
The above files can be in any format or structure.
Changes in one file does not affect other files.
When data becomes complex however, usually transact in JSON or YAML.
What does “Distributed” mean?
It is possible to have your data store across multiple servers.
- Have three servers running etcd. All running an identical copy of the database.
- If you lose one, you still have two copies of the data.
- You can write to any instance and read from any instance.
- etcd ensures that the same copy of data is available on all instances at the same time.
  - How does it do this?
    - Reads are no problem - all of the data is the same across all nodes and can be read from all nodes.
    - However with Writes, for example if write requests come in via two different instances? Which one goes through?
      - For example, one node gets Name John and the other node gets Name Joe. Cannot have two pieces of data on two different nodes.
      - etcd does not process the writes on each node. Only one of the instances is responsible for processing the writes.
      - Internally, the two nodes elect a leader among them.
      - In the instances, one node becomes the leader and the others become the followers.
        
        If the data comes in through the leader node, then the leader processes the write.
        
        If the data comes in through other nodes, but those nodes are not the leader, the data is forwarded onto the leader node itself and then the node processes the writes.
      - When writes a processed, the leader ensures that writes are distributed to other instances in the cluster.
A Write is only considered complete, if the leader gets consent from other members in the cluster.
How do they elect the leader among themselves? How do they ensure a write is propagated across all instances?
- ETCD does consensus via a RAFT system.
  - When the cluster is setup, we have three nodes that don’t have a leader elected.
    - The RAFT algorithm uses random timers for initiating requests.
      - A random timer is started on the three managers.
        
        The first one to finish the timer sends a request to the other nodes, requesting permission to become the leader.
        
        The other nodes receive the request and respond with their vote and the node assumes the leader role.
        
        The leader node then sends out pings at regular intervals, assuming the role of the leader.
        
        If the other nodes don’t receive an update from the leader (for example with connectivity issues), the nodes re-initiative an election process amongst themselves. A new leader is then identified.
A Write is only considered complete, when it is replicated to all other instances in the cluster.
- The ETCD cluster is highly available, so even if a node is lost, it should still function without issue.
  - If a node goes down, a Write is still considered “Complete”, if it can be written to the majority of nodes in the cluster.
    - If the node that is down comes back online, then the data is copied there as well.
The calculation for Quorum is N/2 + 1. Quorum is the minimum number of nodes available, for the cluster to function properly and make a successful write.
- For any size cluster, Quorum is the total amount of nodes, divided by 2, plus 1.
- A Quorum of 3 nodes = 3/2 + 1 = 2.5 ~=2.
  - If there is ever a .5 remaining, consider the whole number only.
- A Quorum of 5 nodes = 5/2 + 1 = 3.5 ~= 3.
- A Quorum of 1 node = 1
  - If you lose the 1 node, everything disappears.
- Quorum of 2 nodes = 2
  - Even if there are two instances in the cluster, the majority is still 2.
  - This doesn’t offer any value, as Quorum cannot be met.
    - That is why it is recommended to have at least 3 instances in an etcd cluster.
      - It offers a fault tolerance of at least 1 node.
Quorum chart:
When deciding on the number of Master Nodes, it is recommended to select an odd number. In the above table, this would be 3, 5 or 7.
- With an even number of nodes, there is the possibility of a cluster failing due to network segmentation and leave the cluster without quorum in certain scenarios.
  - No matter how the network segments, the cluster has a higher chance of staying alive.
To install etcd on a server:
- Download the latest available binary with wget.
- Extract the etcd-v<version_number>-linux-amd64.tar.gz tarball.

Move the file over:

mv etcd-v<version_number>-linux-amd64/etcd* /usr/local/bin/

Make the etcd directories:
```
mkdir -p '/etc/etcd /var/lib/etcd'
```
Copy the certificate:
```
cp ca.pem kubernetes-key
```
Then the etcd service needs to be configured.

Section of note is:

--initial-cluster peer-1=https://${PEER1_IP}:2380,peer-2=https://${PEER2_IP}:2380 \\

The above is how each etcd service knows it is part of a cluster and where its peers are.
Then once installed, use the etcdctl utility to store and retrieve data.
The etcdctl utility has two API versions, V2 and V3.
- The commands are different per version.
- Version 2 is default.
Set an environment variable to fix this the API versions and use V3:
```
export ETCDCTL_API=3 
```
Then run the etcdctl command and in this case the key-pair value is name john:
```
etcdctl put name john
```
To retrieve data, run:
```
etcdctl get name
```
This would retrieve the following:
```
etcdctl get name
```
To get all keys, run this command:
```
etcdctl get / --prefix --keys-only
```
How many nodes should an HA setup have? The answer is 3.
For the best quorum, use 5 instances and that gives you 3 quorum and then a fault tolerance of 2.
So ultimately we have 3 Master Nodes with ETCD running on them.
- Then we also have two Worker Nodes.
  - A Stacked Topology means the ETCD servers are on the Master Nodes as well.