How containers work in Linux – James Bottomley

Hypervisors are based on emulating hardware interfaces. Containers are about virtualising the OS services. I.e., containers will have only a single kernel, hypervisors and VMs have multiple separate kernels.

Containers have the advantage that a single kernel update benefits all guests. And the kernel is the most updated component. Containers are also more elastic, because they are typically smaller and the kernel has a better view of what happens in the guest so can make better scheduling and resource management decisions.

OS container (LXD) contains a full OS including an init system – it basically pretends to be a hypervisor. An application container contains just an application, without the init system and a large part of the shared libs and tools.

In a hypervisor, adding memory is easy but removing it is difficult. You have to go through a complex balloon driver to free up memory from a cooperating guest OS. In a container, scaling is instantaneous, just adapt the limits. However, this evolves because there is more and more hardware support to help hypervisors to achieve the same performance and elasticity as containers.

Containers can virtualise at different granularity levels, e.g. containment of networking could be disabled. But the orchestration systems (Docker, LXC, VZ, …) don’t expose this possibility, they always do full virtualisation. However, the granularity is the one thing where hypervisors can never be as good, so this is the thing where container evangelists (i.e. James) should focus on.

There are two kernel concepts that make containers: CGroups and namespaces. All container systems use the same API – originally there were out-of-tree patches to add different kernel APIs for each container system. In 2011 at the Kernel Summit it was agreed to converge to a single set of APIs. Thus, there is no repeat of the Xen/KVM split, where both hypervisor interfaces are supported in the kernel now. But all of this is still very very new, so it’s not going to work on many enterprise distros (RHEL6, SLES12).

CGroup systems: Block I/O, CPU, devices, memory, network, freezer. Namespaces: network, IPC, mount, PID, UTS (hostname), User (fake root). The user namespace still has lots of problems. The CGroup and namespace APIs are however very difficult to use.

cgroup is typically mounted on /sys/fs/cgroup, separately for each cgroup (with symlinks for historical interfaces). Each cgroup has a number of controls. You add a container by making a directory – the control interfaces appear magically in that directory. The tasks file contains the PIDs of the processes in that control group. Once some PIDs are in there, they can all be manipulated together by writing to the control files in the group directory. The directories are also hierarchical, so you can make subgroups (which obviously can only contain PIDs that are also in the parent).

To manipulate namespaces, there are the unshare and setns tools from util-linux. The namespaces can be found in /proc/pid/ns, which has symlinks to file descriptors for each namespace. You can see which processes share the same namespace by looking which ones point to the same descriptor. To create a namespace, use unshare (sometimes need to do this as root). Now you can bind-mount the namespace symlink to an empty file. This file can be used with nsenter from a different process to enter the same namespace. To release the namespace, the process that created it must first exit, the mount has to be unbound, and then the temporary file. The namespace will still exist until the last process that is in it has exited.

For network namespaces, ip has a subcommand to manipulate them: ip netns. To connect namespaces, you typically add a virtual ethernet device and add it to the namespace with ip link. This way you can do NFV.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s