Contents
Intro
Why Privileged Containers?
Why Complex Docker Images?
K8s-in-Docker with Sysbox
Automating things a bit with Kindbox
The K8s Node Container Image
K8s Node Inner Image Preloading
Wrapping Up
Intro
Recently, Docker containers are being used as a way to deploy Kubernetes (K8s)
clusters. In this setup, each Docker container acts as a K8s node, and the K8s
cluster is made up of a number of these containers (some acting as control-plane
nodes, others as worker nodes) connected to each other via a container
network, as shown below:
Such a setup is ideal for development, local testing, and CI/CD, where the
efficiency and ease of use of containers makes this particularly attractive
(e.g., it avoids the need to deploy heavier virtual machines for the same
purpose).
Tools like K8s.io KinD, initially developed to test K8s itself, are
now being used to create such K8s-in-Docker clusters for other purposes. While
this tool works well for local developer testing, it however has important
limitations that make it less suitable for enterprise or shared environments,
such as:
Using unsecure privileged containers.
Requiring complex container images.
Supporting only a limited set of cluster configurations.
This article describes the reason for these problems and how to overcome them
using the Sysbox runtime to deploy the K8s-in-Docker cluster with strong
isolation and simple docker run commands.
Cloud developers, QA engineers, DevOps, sys admins, or anyone who wishes to
deploy isolated K8s environments will find this info very useful. Note that this
article is specific to containers on Linux.
Why Privileged Containers?
As mentioned above, existing tools for deploying K8s-in-Docker use very unsecure
privileged containers.
The reason for this is that the K8s components running inside the Docker
containers interact deeply with the Linux kernel. They do things like mount
filesystems, write to /proc, chroot, etc. These operations are not allowed in
a regular Docker container, but are allowed in a privileged container.
However, privileged containers in many ways break isolation between the
container and the underlying host. For example, it’s trivial for software
running inside the container to modify system-wide kernel parameters and even
reboot the host by writing to /proc.
Furthermore, since the container acting as a K8s node is privileged, it means
that if an inner K8s pod is deployed using a privileged security policy, that
pod also has root access to the underlying host. This forces you to trust not
only the software running in the K8s nodes, but also in the pods within.
And this lack of isolation is not just a security problem. It can also be a
functional one. For example, a privileged container has write access to
/proc/sys/kernel, where several non-namespaced kernel controls reside. There
is nothing preventing two or more privileged containers from writing conflicting
values to these kernel controls, potentially breaking functionality in subtle
ways that would be hard to debug.
If you are an individual developer deploying K8s-in-Docker on your laptop, using
privileged containers may not be a big deal (though it’s risky). But if you want
to use this in testing or CI/CD frameworks, it’s much more problematic.
Why Complex Docker Images?
Another limitation of existing K8s-in-Docker tools is that they use complex
container configurations. That is, they require specialized container images
with custom (and tricky) container entrypoints as well as complex container run
commands (with several host volume mounts, etc).
For example, the K8s.io KinD base image entrypoint
does many clever but tricky configurations to setup the container’s environment
when it starts.
This ends up restricting the choices for the end-users, who must now rely on
complex base container images developed specifically for these tools, and who
are constrained to the K8s cluster configurations supported by the tools.
The reason for this complexity is that existing K8s-in-Docker tools rely on the
software stack made up of Docker / containerd and the OCI runc to create the
container. This stack was originally designed with the goal of using containers
as application packaging & deployment bundles (for which it does an excellent
job), but falls short of setting up the container properly to run software such
as K8s inside.
As a result, it’s up to the K8s-in-Docker tools to overcome these limitations,
but that results in complex Docker images and Docker run commands, which in turn
create restrictions for the end-users.
K8s-in-Docker with Sysbox
Wouldn’t it be great if a simple docker run could spawn a container inside
of which Kubernetes could run seamlessly and with proper container isolation?
The Sysbox container runtime makes this possible (for the first
time). It does so by setting up the container with strong isolation (via the
Linux user namespace) and in such a way that K8s finds all the kernel resources
it needs to run properly inside the container. That is, it fixes the problem at
the level of the container runtime (where the container abstraction is created).
This has the effect of significantly simplifying the Docker images and commands
required to deploy the containers that make up the cluster, which in turn
removes complexity and enhances flexibility for the end user.
For example, with Docker + Sysbox, deploying a K8s control-plane node can be
done with these simple commands:
1) Deploy the container that acts as the K8s control-plane:
$ docker run --runtime=sysbox-runc -d --name=k8s-control-plane --hostname=k8s-control-plane nestybox/k8s-node:v1.18.2
2) Ask the Kubeadm tool inside the container to initialize it as a K8s master node:
$ docker exec k8s-control-plane sh -c "kubeadm init --kubernetes-version=v1.18.2 --pod-network-cidr=10.244.0.0/16"
3) Configure kubectl on your host to control the cluster (assumes you’ve installed kubectl already):
$ docker cp k8s-control-plane:/etc/kubernetes/admin.conf $HOME/.kube/config
4) Use kubectl to configure the desired K8s container network plugin:
$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
That’s it! With these steps you’ll have a K8s master configured in less than 1 minute.
Deploying a K8s worker node is even easier.
5) Deploy the container that acts as a worker node:
$ docker run --runtime=sysbox-runc -d --name=k8s-worker --hostname=k8s-worker nestybox/k8s-node:v1.18.2
6) In order for the worker to join the cluster, we need a token from the control-plane:
$ join_cmd=$(docker exec k8s-control-plane sh -c "kubeadm token create --print-join-command 2> /dev/null")
7) Ask the worker to join the cluster:
$ docker exec k8s-worker sh -c "$join_cmd"
That’s it! It takes < 15 seconds to add a worker node. To add more workers,
simply repeat steps 5 to 7.
Notice the simplicity of the entire operation. The Docker commands to deploy the
nodes are all very simple, and the kubeadm tool makes it a breeze to setup K8s
inside of the containers (with certificates and everything).
It’s also fast and efficient: it takes less than 2 minutes to deploy a 10-node
cluster on a laptop, with only 1GB of storage overhead (compared to 10GB if you
deploy this same cluster with the K8s.io KinD tool).
Moreover, the containers are strongly secured via the Linux user namespace.
You can confirm this with:
$ docker exec k8s-control-plane sh -c "cat /proc/self/uid_map"
0 165536 65536
which means that root user (id 0) on the K8s control plane container maps to
unprivileged user 165536 on the host. No more privileged containers!
And because you are using Docker commands to deploy the K8s cluster, you have
full control of the cluster’s configuration, allowing you to:
Choose the cluster’s topology to match your needs.
Choose any container image for the K8s nodes (and one which you fully control).
Choose different images for different nodes if you want.
Place the containers on the Docker network of your choice.
Deploy the cluster on a single host or across hosts using Docker overlay networks.
Resize the cluster easily.
Mount host volumes per your needs, etc.
In the near future, constrain the system resources assigned to each K8s node.
In short, you are only limited by Docker’s capabilities.
All of this simplicity comes by virtue of having the underlying container
runtime (Sysbox) take care of setting up the container properly to support
running K8s (as well as other system-level workloads) inside.
Automating things a bit with Kindbox
Though Sysbox enables you to deploy a K8s cluster with simple Docker commands,
it’s easier to have a higher level tool that will do these steps for you.
There is one such tool here: https://github.com/nestybox/kindbox
It’s called Kindbox (i.e., Kubernetes-in-Docker + Sysbox), and it’s basically a
simple bash wrapper around Docker commands similar to those shown in the prior
section.
Kindbox supports commands such as creating a K8s-in-Docker cluster with a
configurable number of nodes, resizing the cluster, and deleting it. You can
deploy multiple clusters on the same host, knowing that they will be strongly
isolated by the Sysbox runtime.
Check out this video to see how it works.
You may be asking: what’s the point of Kindbox if similar tools such as K8s.io
KinD exist already?
The answer is that while at a high level they do the same thing (i.e., manage
creation of a k8s-in-docker cluster), the way they go about this is very
different: K8s.io KinD requires complex Docker configurations and unsecure
privileged containers due to limitations of the OCI runc. On the other hand,
Kindbox uses simple Docker configurations and strongly isolated containers, by
leveraging the capabilities of the Sy