Introduction
One of the key properties of Nestybox system containers
is that they support running system-level software (such as Systemd and Docker)
without resorting to unsecure privileged containers.
This is made possible by Nestybox’s container runtime Sysbox,
which enables Docker to deploy system containers and sets up the
system container abstraction.
This article describes some important security features and benefits
of Nestybox system containers. These are all specific to Linux as we
don’t currently support system containers on other platforms.
Contents
Privileged Container Risks
System Container Isolation Features
Linux User Namespace and Exclusive User-ID mappings
Linux Capabilities
Restricted Device Exposure
System Container Security Benefits
Giving Unprivileged Users Access To A Docker Daemon
Inner Containers Have Two-Layers Of Isolation
More Work Remains
Conclusion
Try it for Free!
Privileged Container Risks
Since system containers provide an alternative to Docker privileged
containers for running system-level workloads, let’s recap some of the
risks of using privileged Docker containers (i.e., those running with
the Docker --privileged flag) and why it’s not a good idea to use
them in general.
Privileged Docker containers are typically used to deploy containers
that run workloads that require deep interaction with the underlying
kernel. For example, Docker requires them to run their official
Docker-in-Docker (DinD) image.
The main problem with Docker privileged containers is that they are
very unsecure.
When you launch a Docker container with the --privileged flag, you
get a container whose root user is actually root on the host, has all
process capabilities enabled, has access to all host devices, and can
read or write system-wide kernel controls via procfs (/proc) and
sysfs (/sys).
In other words, a process within the container can easily gain
control of the host. For example, from within the privileged
container you can reboot the host by simply writing doing:
$ echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger
This lack of isolation means that at a minimum, workloads that run
within the privileged container must be fully trusted. But even then,
any unintended action or bug in the container’s programs can mess up
with your host configuration.
Using privileged Docker containers is risky at best, and should be
avoided when possible.
System Container Isolation Features
Nestybox system containers provide a much more secure alternative to
Docker privileged containers.
They are designed to run the same workloads as privileged containers,
but with stronger isolation from the underlying host.
Below we briefly describe some of the key isolation features
currently present in Nestybox system containers.
Linux User Namespace and Exclusive User-ID mappings
Nestybox system containers always use all Linux namespaces for
enhanced isolation from the host and from other containers.
Of particular importance is the Linux user namespace which works by
mapping privileged user-IDs (e.g., root) inside the namespace to
fully unprivileged user-IDs on the host.
This ensures that the root user inside the system container is only
privileged with respect to resources assigned to the container, but
has no privileges otherwise.
The Nestybox container runtime, Sysbox, creates a user-namespace for
each system container and configures each container with an exclusive
mapping of user-IDs (and group-IDs). This is done to isolate system
containers from the host as well as from each other.
For example, let’s launch a system container with Docker and the
Sysbox container runtime:
$ docker run --runtime=sysbox-runc -it alpine:latest
And let’s check the user namespace user-ID mapping for it:
/ # cat /proc/self/uid_map
0 296608 65536
The way to read this is that the system container’s users in the range
[0:65535] are mapped to the host user-IDs in the range [296608 : 296608+65535].
This mapping is configured by Sysbox.
Now let’s now deploy another system container and check it’s user-ID
map:
$ docker run --runtime=sysbox-runc -it alpine:latest
/ # cat /proc/self/uid_map
0 362144 65536
Notice how Sysbox used different user-ID mappings for this new system
container. The same applies to the group-ID mappings (not shown above).
In other words, system containers deployed with Sysbox get an
exclusive user-ID range of 65536 unprivileged user-IDs on the host. We
use 65536 IDs per container for POSIX compliance.
Why does this matter? Because if a process inside a system container
somehow escapes the container’s root file system, it will find itself
without permissions to access any files on the host or in other
containers, thereby improving system security.
Linux Capabilities
By virtue of using the Linux user namespace, a root process in the
system container may be given all capabilities and the Linux kernel
ensures those capabilities only apply to resources assigned to the
system container (or more accurately, resources associated with the
Linux namespaces that combine to make up the system container).
In fact, the init process for the root user in a Nestybox system
container starts with all capabilities enabled:
$ docker run --runtime=sysbox-runc -it alpine:latest
/ # whoami
root
/ # cat /proc/self/status | grep -i cap
CapInh: 0000003fffffffff
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000003fffffffff
These capabilities only apply to resources associated with the system
container. In fact, processes in the system container have no
capabilities with respect to system-wide resources or resources
associated with other containers.
For example, below we repeat the same command shown earlier that allows a
privileged container to reboot the host (!), but this time from
within a system container:
/ # echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger
/bin/sh: can't create /proc/sys/kernel/sysrq: Permission denied
The Linux kernel prevents the access as it understands that sysrq is
a privileged system-wide resource and the root process in the system
container has no privileges to access it (even though it has full
capabilities within the container).
This ensures the system container processes are only allowed to
act on resources assigned to the system container, and can’t modify
system-wide settings.
Restricted Device Exposure
Earlier we mentioned that privileged Docker containers expose all host
devices inside the container, in essence giving the container full
control of the host’s physical and software devices.
In contrast, system containers expose a much lesser number
of devices.
For example, when deploying system containers with Docker, you
typically see only the following devices inside the system container:
/dev/null
/dev/zero
/dev/full
/dev/random
/dev/urandom
/dev/tty
/dev/console
/dev/pts
/dev/mqueue
/dev/shm
This reduced set of devices further helps isolate the system container
from the underlying host.
System Container Security Benefits
The prior section described several features used by Nestybox system
containers to increase their isolation from the rest of the system.
This section describes other security benefits made possible by these
system containers.
Giving Unprivileged Users Access To A Docker Daemon
One of the security precautions used by the Docker daemon is to
disallow unprivileged users on a host to create containers.
That is, in order to create containers on a host the user must be
either the root user or belong to the docker group (an action
which requires root privileges).
The reason unprivileged users are not allowed to create containers is
that the Docker daemon on the host runs as root (due to its deep
interactions with the Linux kernel). Allowing an unprivileged user to
create Docker containers would allow that user to easily gain root
access on the machine (e.g., by creating a privileged container for
example).
While this restriction makes sense from a security perspective, it’s
burdensome on hosts shared by multiple users that want to use
Docker. It forces the system admin to either trust all users and give
them access to create Docker containers (which is equivalent to giving
them root access on the host), or have the sys admin create the
containers on behalf of the users.
Nestybox system containers offer an easy-to-use, efficient solution to
this problem: a sys admin can now create “docker sandboxes� using
system containers, and assign them to unprivileged users. Each sandbox
could be configured with systemd, Docker, and sshd as shown below:
Unprivileged users can then ssh into their sandbox and deploy Docker
containers within it in total isolation from the rest of the system
and without requiring root privileges on the host.
This approach solves the problem quickly and easily, and without
resorting to a heavy-weight solution such as deploying a VM.
This Nestybox blog post has more info on how
to deploy Docker sandboxes using system containers.
Inner Containers Have Two-Layers Of Isolation
Another effect of running Docker inside a Nestybox system container is
that containers deployed inside the system container are under two
layers of isolation from the rest of the system (as is evident from
the figure shown above).
That is, when deploying a Docker container inside a system container,
processes inside the “inner container� are restricted by a combination
of the inner and outer container isolation mechanisms (e.g.,
namespaces, cgroups, system call whitelist, exposed devices, etc.).
This strengthens “defense-in-depth� on the host (i.e., escaping to the
host requires bypassing isolation mechanisms of the inner container
and the system container).
More Work Remains
While Nestybox sys