Kubernetes: Architecture

Created: September 12, 2020 (Updated: March 28, 2021)

Components

Master nodes.
Worker nodes.
Controllers.
Services.
Pods.
Namespaces and quotas.
Network and policies.
Storage.

Namespaces

All pods are in a Linux namespace, including system pods.
In addition to Linux namespaces, there are also API namespaces.
Every API call includes an implicit default namespace, unless another is explicitly specified.
A ResouceQuota object can be used to set hard and soft resource limits on a namespace.
- Resources other than CPU and memory can be managed.
- scopeSelector is a quota spec allows for prioritisation of Pods if priorityClassName is set in the Pod’s PodSpec.
Each node object exists in the kube-node-lease namespace.
All containers in a Pod share the same network namespace.

Master node components

The components that ensure the current state of the cluster matches the desired state are called the control plane.
The control plane is currently formed of etcd, kube-apiserver, kube-scheduler, kube-controller-manager, and (optionally) cloud-controller-manager.
Control plane components running as pods are view-able with kubectl get pods -n kube-system.

etcd

etcd is a b+tree key-value store used to hold the state of the cluster, networking, and other information.

Values are always appended to the end rather than immediately replaced.
Redundant data marked for removal and is removed by a compaction process.
If simulateneous requests are made to the kube-apiserver to update a value, the first request updates the value, while the second results in a 409 error because the version number is different.
In a cluster, there is one master etcd database; the other databases are followers.
All etcd instances communicate with each other constantly to determine who the master is and who the successor to the master is.
Only the kube-apiserver communicates with the etcd instances.
etcdctl can query the database.
calicoctl allows for in-depth viewing of network configuration.
Both components need to be installed independently of Kubernetes.
The Felix daemon, Calico’s main agent, monitors and manages each interface, routing, ACLs, and state reporting.
Felix uses the BIRD daemon to propagate dynamic IP routing information to other nodes.
etcd can be backed-up using the snapshot save and snapshot restore flags of etcdctl.

kube-apiserver

kube-apiserver is the master process for the cluster and handles all calls, both internal and external.

kube-apiserver is the only connection to etcd.
The Kubernetes API provides a CRUD interface for querying and modifying the cluster state over a RESTful API.
When the Kubernetes API server receives a message, it goes through three phases before making changes to etcd: authentication, authorization, and admission.
In the authentication phase, the Kubernetes API ascertains which entity the message has come from.
In the authorization phase, the Kubernetes API ascertains whether the entity is allowed to perform the message action.
If the message specifies a CRUD action (rather than simply reading), it goes through the admission control phase.
In the admission control phase, the Kubernetes API iterates through plugins and allows them to manipulate the message action. Once completed, the API server validates the resultant object, stores it in etcd, and returns a response.
kubectl interacts with the Kubernetes API to perform actions.
Each Kubernetes control plane component only communicates directly with the API server.
YAML is converted to JSON before it is sent to the API server.

kube-scheduler

kube-scheduler is responsible for scheduling Pods to run on nodes.

An algorithm is used to determine destination nodes for Pods.
The schedulder is responsible for binding resources, such as volumes.
The scheduler contiuously tries to deploy a Pod until successful, unless there is no availability.
The algorithm is impacted by variables such as node affinity, taints, tolerations, and labels.
A custom scheduler can be used instead of kube-scheduler.
Pods can be bound to specific nodes.
Scheduler source code.

kube-controller-manager

kube-controller-manager interacts with the kube-apiserver to determine the current state of the cluster. If the current state diverges from the desired state, the manager contacts a controller to attempt to make the necessary adjustments.

Controllers incluse endpoints, namespaces, and replication.

cloud-controller-manager

Manages controllers associated with the builtin cloud providers.

kube-controller-manager used to handle these tasks.
cloud-controller-manager was created because the pace of cloud provider change was too rapid for the main Kubernetes project.

CoreDNS

CoreDNS handles DNS queries, service discovery, and other tasks.

CoreDNS replaces kube-dns.
It is easily extendable.

Components than run on all nodes

kubelet

The kubelet agent listens for PodSpec API calls.
When it receives a call, it will makes changes on a worker node until the current state matches the PodSpec.
kubelet communicates with container engines on the local machine to achieve manipulation of Pods.
The agent also handles volume mounts to Pods and downloads of secrets/ConfigMaps.
Messages are sent back to the kube-apiserver to report status and to ensure persistence.
kubelet handles the creation of Pods found in /etc/kubernetes/manifests/.
When using kudeadm, the kubelet agent runs as a systemd service.
Topology Manager is an additional alpha feature that kubelet can call. Topology Manager communicates with other components to configure optimal resource (e.g. CPU) assignments for the cluster.

kube-proxy

kube-proxy handles network communication between nodes using firewall routing rules (currently iptables).
Appears as a kube-system pod.

Container runtime

Docker, containerd, Podman, etc

PodSpec

The resources section of a PodSpec can be used to determine resource allocation for the underlying container.
```
resources:
limits:
  cpu: "2"
  memory: "2Gi"
requests:
  cpu: "1"
  memory: "1Gi"
```

Init Containers

An Init container must start successfully before application containers can start.
If an Init container doesn’t start, the application container will not start.
If an Init container fails, Kubernetes tries to restart it until it is started; unless restartPolicy is Never.
An Init container can have a different storage and security context than application containers.
An Init container might have access to diagnostic tools or script.
Example:

spec:
  containers:
  - name: web-server
    image: apache2
  initContainers:
  - name: init-web-server
    image: busybox
    command: ['sh', '-c', 'for i in {1..100}; do sleep 1; if dig databaseService; then exit 0; fi; done; exit 1']

Services

Services connect Pods together and connect Pods to networks outside the cluster.
If a Pod dies and is replaced, services will reconnect with the replacement.
Services also handle Pod access policies.
Services decouple agents and objects.
Each Service handles a particular subset of traffic, e.g. a load balancer.
kube-controller-manager monitors for the need to create, update, or delete Services and endpoints.

Endpoints

Endpoints are created at the same time as Services.
Separate endpoints are required for using both IPv4 and IPv6.
Endpoints use a Pod’s IP address, but also include a port.
Services map high numbered ports to endpoints using iptables with ipvs.

Controllers

Controllers perform the task of assembling Workqueues of changes to hand to workers.
They are formed of an Informer (agent) and a downstream store.
The Informer calls the API server to request the state of an object, which is then cached.
A SharedInformer is an alternative that creates a shared cache.
A controller compares the source and downstream with a DeltaFIFO queue.
An object, an array of deltas and the result of the previous comparison, is sent to a loop process.
The controller logic modified the object until it matches specifications, unless the type is Deleted.
Types of controllers include endpoints, namespace, and serviceaccounts.

Pods

Containers in a Pod are started in a parallel; startup order cannot be guaranteed.
InitContainers allows for some manipulation of startup order.
It’s often necessary to have multiple containers in a single Pod, for example, to support logging.
A sidecar is a container that performs a secondary assistive function in a Pod, such as logging.
Most network plugins only allow a Pod to have one IP address. There is a network plugin from HPE Labs that allows multiple IP addresses.
As such, containers that share Pods must communicate with IPC, loopback, or a shared filesystem.
Kubernetes main focus is container lifecycle rather than managing containers. Managing containers is left to the container engine.

Pod networking

IP addresses are assigned before containers start.
Container interfaces look like eth0@tun10.
IPs are persistent for a Pod’s lifetime.

Pod-to-Pod communication

Kubernetes requires:

All Pods on all nodes can communicate with each other.
All nodes can communicate with all pods.
No NAT.

Nodes

Master nodes are created with kudeadm init.
Worker nodes are created with join,
If kube-apiserver and kubelet cannot communicate for 5 minutes, the cluster will schedule Pods elsewhere. Once communication resumes, the Pods will be evicted.
Each node object exists in the kube-node-lease namespace.

To delete a node:

kubectl delete node NODE      # pods are evicted
kubeadm reset                 # remove cluster info

iptables rules may be left over and need to be deleted manually.

Networking

Kubernetes cannot handle pod-to-pod communication without a networking plugin.
A detailed explanation of Kubernetes networking.
An explanation of Kubernetes networking by Tim Hockin
Container Network Interface (CNI) is a specification and libraries for writing plugins to configure network interfaces in Linux containers.
CNI GitHub.

Main deployment configurations

Single-node.
Single control plane node, multiple workers.
Multiple control plane nodes with HA, multiple workers.
HA etcd, HA control plane nodes, multiple workers.

High availability

Each control plane component can be replicated to achieve high availability. However, the scheduler and controller manager must remain on standby on non-leader control plane nodes.
The quorum of control plane nodes decides on a leader. The number of control plane nodes must be odd to make decisions.
The current leader is shown as the value of holderIdentity, in the control-plane.alpha.kubernetes.io/leader annotation, displayed using the command:
```
kubectl get endpoints kube-scheduler -n kube-system -o yaml
```
etcd requires special configuration for HA.
There must be consistency in the majority of etcd clusters in order to maintain quorum.
etcd can be in a stacked or external topology.

A stacked etcd topology

In a stacked topology, an etcd pod is created on each control plane node. Each etcd only communicates with the kube-apiserver on the same node.
A stacked topology is simpler to set up and simpler to manage. However, if a node goes down, both a control plane instance and etcd member are lost, increasing the chances of cluster failure.
The Kubernetes documentation recommends a minimum of three stacked control plane nodes for a HA cluster.

An external etcd topology

In an external topology, etcd members are external to the cluster. Each etcd only communicates with the kube-apiserver of each control plane node.
An external topology is more durable to loss of control plane instances or etcd members than a stacked topology.
An external etcd topology requires a minimum of six hosts; three hosts for control plane nodes and three hosts for etcd nodes.