Source: Nestybox Blog

Nestybox Blog System Container Security Features

Introduction One of the key properties of Nestybox system containers is that they support running system-level software (such as Systemd and Docker) without resorting to unsecure privileged containers. This is made possible by Nestyboxâ€™s container runtime Sysbox, which enables Docker to deploy system containers and sets up the system container abstraction. This article describes some important security features and benefits of Nestybox system containers. These are all specific to Linux as we donâ€™t currently support system containers on other platforms. Contents Privileged Container Risks System Container Isolation Features Linux User Namespace and Exclusive User-ID mappings Linux Capabilities Restricted Device Exposure System Container Security Benefits Giving Unprivileged Users Access To A Docker Daemon Inner Containers Have Two-Layers Of Isolation More Work Remains Conclusion Try it for Free! Privileged Container Risks Since system containers provide an alternative to Docker privileged containers for running system-level workloads, letâ€™s recap some of the risks of using privileged Docker containers (i.e., those running with the Docker --privileged flag) and why itâ€™s not a good idea to use them in general. Privileged Docker containers are typically used to deploy containers that run workloads that require deep interaction with the underlying kernel. For example, Docker requires them to run their official Docker-in-Docker (DinD) image. The main problem with Docker privileged containers is that they are very unsecure. When you launch a Docker container with the --privileged flag, you get a container whose root user is actually root on the host, has all process capabilities enabled, has access to all host devices, and can read or write system-wide kernel controls via procfs (/proc) and sysfs (/sys). In other words, a process within the container can easily gain control of the host. For example, from within the privileged container you can reboot the host by simply writing doing: $ echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger This lack of isolation means that at a minimum, workloads that run within the privileged container must be fully trusted. But even then, any unintended action or bug in the containerâ€™s programs can mess up with your host configuration. Using privileged Docker containers is risky at best, and should be avoided when possible. System Container Isolation Features Nestybox system containers provide a much more secure alternative to Docker privileged containers. They are designed to run the same workloads as privileged containers, but with stronger isolation from the underlying host. Below we briefly describe some of the key isolation features currently present in Nestybox system containers. Linux User Namespace and Exclusive User-ID mappings Nestybox system containers always use all Linux namespaces for enhanced isolation from the host and from other containers. Of particular importance is the Linux user namespace which works by mapping privileged user-IDs (e.g., root) inside the namespace to fully unprivileged user-IDs on the host. This ensures that the root user inside the system container is only privileged with respect to resources assigned to the container, but has no privileges otherwise. The Nestybox container runtime, Sysbox, creates a user-namespace for each system container and configures each container with an exclusive mapping of user-IDs (and group-IDs). This is done to isolate system containers from the host as well as from each other. For example, letâ€™s launch a system container with Docker and the Sysbox container runtime: $ docker run --runtime=sysbox-runc -it alpine:latest And letâ€™s check the user namespace user-ID mapping for it: / # cat /proc/self/uid_map 0 296608 65536 The way to read this is that the system containerâ€™s users in the range [0:65535] are mapped to the host user-IDs in the range [296608 : 296608+65535]. This mapping is configured by Sysbox. Now letâ€™s now deploy another system container and check itâ€™s user-ID map: $ docker run --runtime=sysbox-runc -it alpine:latest / # cat /proc/self/uid_map 0 362144 65536 Notice how Sysbox used different user-ID mappings for this new system container. The same applies to the group-ID mappings (not shown above). In other words, system containers deployed with Sysbox get an exclusive user-ID range of 65536 unprivileged user-IDs on the host. We use 65536 IDs per container for POSIX compliance. Why does this matter? Because if a process inside a system container somehow escapes the containerâ€™s root file system, it will find itself without permissions to access any files on the host or in other containers, thereby improving system security. Linux Capabilities By virtue of using the Linux user namespace, a root process in the system container may be given all capabilities and the Linux kernel ensures those capabilities only apply to resources assigned to the system container (or more accurately, resources associated with the Linux namespaces that combine to make up the system container). In fact, the init process for the root user in a Nestybox system container starts with all capabilities enabled: $ docker run --runtime=sysbox-runc -it alpine:latest / # whoami root / # cat /proc/self/status | grep -i cap CapInh: 0000003fffffffff CapPrm: 0000003fffffffff CapEff: 0000003fffffffff CapBnd: 0000003fffffffff CapAmb: 0000003fffffffff These capabilities only apply to resources associated with the system container. In fact, processes in the system container have no capabilities with respect to system-wide resources or resources associated with other containers. For example, below we repeat the same command shown earlier that allows a privileged container to reboot the host (!), but this time from within a system container: / # echo 1 > /proc/sys/kernel/sysrq && echo b > /proc/sysrq-trigger /bin/sh: can't create /proc/sys/kernel/sysrq: Permission denied The Linux kernel prevents the access as it understands that sysrq is a privileged system-wide resource and the root process in the system container has no privileges to access it (even though it has full capabilities within the container). This ensures the system container processes are only allowed to act on resources assigned to the system container, and canâ€™t modify system-wide settings. Restricted Device Exposure Earlier we mentioned that privileged Docker containers expose all host devices inside the container, in essence giving the container full control of the hostâ€™s physical and software devices. In contrast, system containers expose a much lesser number of devices. For example, when deploying system containers with Docker, you typically see only the following devices inside the system container: /dev/null /dev/zero /dev/full /dev/random /dev/urandom /dev/tty /dev/console /dev/pts /dev/mqueue /dev/shm This reduced set of devices further helps isolate the system container from the underlying host. System Container Security Benefits The prior section described several features used by Nestybox system containers to increase their isolation from the rest of the system. This section describes other security benefits made possible by these system containers. Giving Unprivileged Users Access To A Docker Daemon One of the security precautions used by the Docker daemon is to disallow unprivileged users on a host to create containers. That is, in order to create containers on a host the user must be either the root user or belong to the docker group (an action which requires root privileges). The reason unprivileged users are not allowed to create containers is that the Docker daemon on the host runs as root (due to its deep interactions with the Linux kernel). Allowing an unprivileged user to create Docker containers would allow that user to easily gain root access on the machine (e.g., by creating a privileged container for example). While this restriction makes sense from a security perspective, itâ€™s burdensome on hosts shared by multiple users that want to use Docker. It forces the system admin to either trust all users and give them access to create Docker containers (which is equivalent to giving them root access on the host), or have the sys admin create the containers on behalf of the users. Nestybox system containers offer an easy-to-use, efficient solution to this problem: a sys admin can now create â€œdocker sandboxesâ€� using system containers, and assign them to unprivileged users. Each sandbox could be configured with systemd, Docker, and sshd as shown below: Unprivileged users can then ssh into their sandbox and deploy Docker containers within it in total isolation from the rest of the system and without requiring root privileges on the host. This approach solves the problem quickly and easily, and without resorting to a heavy-weight solution such as deploying a VM. This Nestybox blog post has more info on how to deploy Docker sandboxes using system containers. Inner Containers Have Two-Layers Of Isolation Another effect of running Docker inside a Nestybox system container is that containers deployed inside the system container are under two layers of isolation from the rest of the system (as is evident from the figure shown above). That is, when deploying a Docker container inside a system container, processes inside the â€œinner containerâ€� are restricted by a combination of the inner and outer container isolation mechanisms (e.g., namespaces, cgroups, system call whitelist, exposed devices, etc.). This strengthens â€œdefense-in-depthâ€� on the host (i.e., escaping to the host requires bypassing isolation mechanisms of the inner container and the system container). More Work Remains While Nestybox sys

Read full article »

Nestybox's Competitors | Nestybox's News | Nestybox's Financials

Followers on Owler

Est. Annual Revenue

$100K-5.0M

Est. Employees

1-25

Co-Founder & CEO

Cesar Talledo

CEO Approval Rating

90/100

Nestybox is headquartered in San Jose, California. Cesar Talledo is the Co-Founder & CEO of Nestybox. Nestybox has 1 followers on Owler.