Contents
Intro
Motivation
Setup
Why is Sysbox Useful Here?
Kubernetes Cluster Creation
Defining the Pods (with Docker inside)
Persistent Docker Cache
Deploying the Pods
Verify the Pods are Working
Exposing the Pods IP to the outside
Connecting Remotely to the Pods
Shared Docker Images across Docker Engines
Resource Limits
Scaling Pod Instances
Persistent Volume Removal
Docker Build Context
Conclusion
Resources
Intro
This post shows you how to run Docker inside a secure (rootless) Kubernetes
pod. That is, you create one or more Kubernetes pods and inside of each you
run Docker.
While running Docker inside pods is not new, what’s different here is that the
pod will not be an insecure “privileged� pod. Instead, it will be a fully
unprivileged (rootless) pod launched with Kubernetes and the
Sysbox runtime, which means you can use this setup in enterprise
settings where security is very important.
We will show you how to set this up quickly and easily with examples, and
afterwards you can adjust these per your needs.
Motivation
There are several uses cases for running Docker inside a Kubernetes pod; a
couple of useful ones are:
Creating a pool of Docker engines on the cloud. Each user is assigned one
such engine and connects remotely to it via the Docker CLI. Each Docker
engine runs inside a Kubernetes pod (instead of a VM), so operators can
leverage the power of Kubernetes to manage the pool’s resources.
Running Docker inside Kubernetes-native CI jobs. Each job is deployed inside
a pod and the job uses the Docker engine running inside the pod to build
container images (e.g., Buildkit), push them to some repo, run them, etc.
In this blog post we focus on the first use case. A future blog post will focus
on the second use case.
Setup
The diagram below shows the setup we will create:
As shown:
Kubernetes will deploy the pods with the Sysbox runtime.
Each pod will run a Docker engine and SSH in it.
Each Docker engine will be assigned to a user (say a developer working from
home with a laptop).
The user will connect remotely to her assigned Docker engine using the
Docker CLI.
Why is Sysbox Useful Here?
Prior to Sysbox, the setup shown above required insecure “privileged� containers
or VM-based alternatives such as KubeVirt.
But privileged containers are too insecure, and VMs are slower, heavier, and
harder to setup (e.g., KubeVirt requires nested virtualization on the cloud).
With Sysbox, you can do this more easily and efficiently, using secure
(rootless) containers without resorting to VMs.
Kubernetes Cluster Creation
Ok, let’s get to it.
First, you need a Kubernetes cluster with Sysbox installed in it. It’s pretty easy
to set this up as Sysbox works on EKS, GKE, AKS, on-prem Kubernetes, etc.
See these instructions to install Sysbox on your cluster.
For this example, I am using a 3-node Kubernetes cluster on GKE, and I’ve
installed Sysbox on it with this single command:
kubectl apply -f https://raw.githubusercontent.com/nestybox/sysbox/master/sysbox-k8s-manifests/sysbox-install.yaml
Defining the Pods (with Docker inside)
Once Sysbox is installed on your cluster, next step is to define the pods that
carry the Docker engine in them.
We need a container image that carries the Docker engine. In this example, we
use an image called nestybox/alpine-supervisord-docker:latest that carries
Alpine + Supervisord + sshd + Docker. The Dockerfile is
here.
NOTE: You can use another image if you would like. Just make sure that the
image is configured to start Docker and SSH inside the container
automatically.
Next, let’s create a Kubernetes StatefulSet that will provision 6 pod instances
(e.g., 2 per node). Each pod will allow remote access to the Docker engine via
ssh. Here is the associated yaml file:
$ cat dockerd-statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: dockerd-statefulset
spec:
selector:
matchLabels:
app: dockerd
serviceName: "dockerd"
replicas: 6
template:
metadata:
labels:
app: dockerd
annotations:
io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
runtimeClassName: sysbox-runc
terminationGracePeriodSeconds: 20
containers:
- name: alpine-docker
image: nestybox/alpine-supervisord-docker:latest
ports:
- containerPort: 22
name: ssh
volumeMounts:
- name: docker-cache
mountPath: /var/lib/docker
volumeClaimTemplates:
- metadata:
name: docker-cache
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "gce-pd"
resources:
requests:
storage: 2Gi
podManagementPolicy: Parallel
Before we apply this yaml, let’s analyze a few things about it.
First, we chose a StatefulSet (instead of a Deployment) because we want each pod
to have unique and persistent network and storage resources across it’s
life cycle. This way if a pod goes down, we can recreate it and it will have the
same IP address and the same persistent storage assigned to it.
Second, note the following about the StatefulSet spec:
It creates 6 pods in parallel (see replicas and podManagementPolicy).
The pods are rootless by virtue of using Sysbox (see the cri-o annotation and
sysbox-runc runtimeClassName).
Each pod exposes port 22 (ssh).
Each pod has a persistent volume mounted onto the pod’s /var/lib/docker
directory (see next section).
Persistent Docker Cache
In the StatefulSet yaml shown above, we mounted a persistent volume on each
pod’s /var/lib/docker directory.
Doing this is optional, but enables us to preserve the state of the Docker engine
(aka “the Docker cache�) across the pod’s life cycle. This state includes pulled
images, Docker volumes and networks, and more. Without this, the Docker state
will be lost when the pod stops.
Note that each pod must have a dedicated volume for this. Multiple pods can’t
share the same volume because each Docker engine must have a dedicated cache
(it’s a Docker requirement).
Also, note that the persistent storage is provisioned dynamically (at pod
creation time, one volume per pod). This is done via a volumeClaimTemplate
directive, which claims a 2GiB volume of a storage class named “gce-pd�.
For this example, 2GiB is sufficient; for a production scenario, you’ll likely
need much more storage since Docker storage can add up over time when pulling
multiple images.
What is “gce-pd�? It’s a storage class that uses the Google Compute Engine (GCE)
storage provisioner. The resource definition is below:
$ cat gce-pd.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gce-pd
provisioner: kubernetes.io/gce-pd
parameters:
type: pd-ssd
fstype: ext4
replication-type: none
volumeBindingMode: WaitForFirstConsumer
Since my cluster is on GKE, using the GCE storage provisioner makes
sense. Depending on your scenario, you can use any other
provisioner
supported by Kubernetes (e.g., AWS EBS, Azure Disk, etc).
In addition, whenever we use volumeClaimTemplate, we must also define a dummy
local-storage class (as otherwise Kubernetes will fail to deploy the pod):
$ cat local-storage.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
Deploying the Pods
With this in place, we can now apply the yamls shown in the prior section.
$ kubectl apply -f gce-pd.yaml
$ kubectl apply -f local-storage.yaml
$ kubectl apply -f dockerd-statefulset.yaml
If all goes well, you should see the StatefulSet pods deployed within 10->20 seconds,
as shown below:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dockerd-statefulset-0 1/1 Running 0 9m51s
dockerd-statefulset-1 1/1 Running 0 9m51s
dockerd-statefulset-2 1/1 Running 0 9m51s
dockerd-statefulset-3 1/1 Running 0 9m51s
dockerd-statefulset-4 1/1 Running 0 9m51s
dockerd-statefulset-5 1/1 Running 0 9m51s
You should also see the persistent volumes that Kubernetes dynamically
allocated to the pods:
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-377c35d8-4075-4d40-9d26-7e4acd42cbea 2Gi RWO Delete Bound default/docker-cache-dockerd-statefulset-1 gce-pd 14m
pvc-5937a358-5111-4b91-9cce-87a8efabbb62 2Gi RWO Delete Bound default/docker-cache-dockerd-statefulset-3 gce-pd 14m
pvc-5ca2f6ba-627c-4b19-8cf0-775395868821 2Gi RWO Delete Bound default/docker-cache-dockerd-statefulset-4 gce-pd 14m
pvc-9812e3df-6d7e-439a-9702-03925af098a5 2Gi RWO Delete Bound default/docker-cache-dockerd-statefulset-0 gce-pd 14m
pvc-afd183ab-1621-44a1-aaf0-da0ccf9f96a8 2Gi RWO Delete Bound default/docker-cache-dockerd-statefulset-5 gce-pd 14m
pvc-e3f65dea-4f97-4c4b-a902-97bf67ed698b 2Gi RWO Delete Bound default/docker-cache-dockerd-statefulset-2 gce-pd 14m
Verify the Pods are Working
Let’s exec into one of the pods to verify all is good:
$ kubectl exec dockerd-statefulset-0 -- ps
PID USER TIME COMMAND
1 root 0:00 {supervisord} /usr/bin/python3 /usr/bin/supervisord -n
14 root 0:00 /usr/bin/dockerd
15 root 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
45 root 0:02 containerd --config /var/run/docker/cont