Tag Archives: FOSDEM 15

Live atomic updates – Installing new software without the need for packages or a reboot (Richard Maw)

Live Atomic update:

  • Live == Online == Update while it is running.
  • Atomic == all or nothing; if the update fails, you go back to the old version.

Atomic update is traditionally done by restarting the whole system, by either A-B partitioning or by using a rescue partition. Richard works at Baserock, where the atomic update is done by rebooting in a new btfs subvolume after extracting the new bits in a clones subvolume.

Why atomic? It helps support because they see exactly what you’re running, and you don’t risk rendering the system unusable if the update is interrupted – per-file atomic write is not sufficient. So you need an atomic filesystem update.

In an atomic filesystem update, you first create the new version of the filesystem. This is done with a combination of btfs subvolumes and bind-mounting. Then the mount-tree is reproduced, then a pivot_root is applied to get the new rootfs. But of course the old processes are still pointing to the old versions on the old rootfs.

Richard first idea was to use ptrace to chroot the process and to reopen all the fds. Problems: not all processes can be ptraces, not all processes are allowed to execute chroot, and some processes (e.g. journald) cache inode numbers which will change after reopening.

Alternative for pivot_root is to use renameat, but that doesn’t solve migrating the processes.

Alaternative approach: fake atomicity by make a filesystem transaction that would be rolled back on failure. That turns out not to work because the API isn’t powerful enough.

Yet another alternative: use freeze so you can restore to the old context on (power) failure.

Alternative approach 3: let init propagate the migration down, so keep the old one around until all processes have terminated.

Alt. approach 4: use a layer inbetween (e.g. augs) so you just add the new layer on top.

Actually, it turns out that most services will have to be restarted anyway (perhaps gracefully without dropping connections, like Apache can do). So it’s mainly for handing over the shells, and there the ptrace approach can be used.

What could microkernels learn from monolithic kernels and vice versa (Martin Děcký)

Martin is OS researcher and co-author of HelenOS since 2005, but also an active user and occasional contributor to GNU/Linux. HelenOS is a portable microkernel and multiserver OS. It runs e.g. on Beagle*, RPi, with ext4 and IPv6 and compositing GUI.

First of all, we should realize that there is no iron curtain between microkernels and monolithic kernels, rather there is a spectrum between the micro- and monolith aspects. For instance, the core memory management is often considered as part of the memory management, but e.g. SCL4 does the memory management (policy) in user space. Also, it’s a multidimensional space with e.g. isolation, componentization and dynamicity.

New kid on the block is OS-V, which is designed specifically for VMs. It builds on the concept that Antii highlighted in his keynote speech: it removes layers of abstraction. It has the API necessary to run POSIX applications, but with a single user, single process, single image, single address space. It is the hypervisor that will take care of the necessary isolation etc.

This multidimensional space exists because different people have different needs. So all these systems can exist peacefully side by side.

Monolithic kernels also learn from microkernels. For instance, they make it possible to move things to userspace: FUSE, libusb, tun devices. The performance penalty can still be handled, e.g. avoiding memory copies. On the other hand, there are also things moving in the other direction, like KMS.

Monolithic kernels could learn from microkernels that the componentisation that you have in the source code is also kept in the running code, i.e. make it clear which component owns which data.

Another thing that monolithic kernels could learn is that bootstrapping is different from the run time. So it is not a good idea to design separate code paths for bootstrap and shutdown, and design the kernel itself as if it is running forever.

Microkernels learn from monolithic kernels to use smart algorithms and data structures. These things are usually first developed in monolithic kernels, so actually microkernels are less innovative which is surprising.

A second thing that microkernels can learn is scalability. Linux scales to 1000s of CPUs, while microkernels struggle with 16 CPUs. Same goes for scaling to embedded systems.

And finally, monolithic kernels are surprisingly portable, while microkernels often target just a single or maybe two architectures. If you design the OS without portability in mind,

Microkernel people sometimes have goals which are just counterproductive.

  • Restarting services as a way of dependability. However, just restarting is pretty bad because you don’t have the internal state anymore. In addition, the logical state is usually spread over several services, so restarting risks bringing down other services as well. And also just restarting may just recrash it. So it’s a nice idea but can’t be a dependability goal on its own.
  • Micro == small. But that doesn’t mean that the last bit of reduction in code size is still worth it. But code size or binary size has at best a weak correlation between actual bugs.
  • Trivial algorithms protect against bugs. This is not true, you can have trivial bugs in trivial code, and you can audit or formally verify complex algorithms.

What’s new inside the Linux IEEE 802.15.4 subsystem? (Alexander Aring, Pengutronix)

lowpan devices are an IPv6 virtual interface on top of a 6loPAN (wpan) device. The lowpan interface takes care of fragmentation, protocol handling,

The project is called linux-wpan (zigbee is avoided for TM reasons and because it’s more than zigbee). A rework is in progress to give it a netlink framework (nl802154). Crypto layer is ongoing and also the frame parsing and creation is only partially implemented (only data). The new framework cherry-picks the good things from the WLAN stack. Default interface naming: wpan#; interface type registration: node, monitor, coordinator (todo); iwpan userspace tool with same commands as iw, which uses the netlink framework nl802154; soft-MAC implementation cfg802154. Copying ideas from 802.11 is still ongoing.

nl802154 framework was introduced because of familiarity for WLAN users; before there was a custom netlink interface.  Old interface still needs to be removed.

Next Header Compression (NHC) feature: allows to remove the info from headers when it stays the same anyway, e.g. IPv6 extension headers, UDP ports. NHC framework is simple: kernel module per compression function, with callback for compression and decompression. Only UDP exists at the moment; the others are already registered so a Not Implemented warning is kprinted. Also a netlink interface for configuring NHC has to be created (e.g. for UDP you can choose a shorter CRC).

picoTCP for Linux Kernel tinification (Maxime Vincent, Altran (former TASS))

Linux runs on cell phones, but not (yet) on really small devices. In addition, the minimum kernel size (make allnoconfig) is steadily increasing. With tiny, take for instance the STM32F4 cortex M4 board from emCraft – it has 2MB SDRAM and 16MB Flash. You need ucLinux (no MMU), you want XIP to save RAM and because flash can be accessed directly. In 3.17, the ‘make tinyconfig’ option was introduced, which e.g. turns on -Os. But to get it small, you have to disable the TCP/IP stack because that is large.

So replacing the TCP/IP stack is an important step for tinyfication. picoTCP was already portable to many platforms, now you can also use it within Linux.

The TCP/IP stack is removed from the kernel, but the socket interface and netdevices remain, so NET=y and INET=n. That saves 164KB (10% of the kernel size). picoTCP adds 43KB again. It also saves 216KB in runtime RAM usage. New module PICOTCP is added in the tree. The picoTCP stack is standalone so glue logic is needed: proc files, ioctls, register protocol family. Also netdevice has to be modified to call the picoTCP stack. Still have to implement the rtnetlink interface (currently only ioctl interface is supported), which is necessary for e.g. iproute2. Also IPv6 (which is supported by picoTCP) isn’t handled yet.

In this project, the stack is put in the kernel instead of running it in userspace. But if you put it in userspace, you still need a tun or tap device to get it out to other processes. In-kernel it becomes available to any application.

Wikimedia adopts Phabricator, deprecates 7 infrastructure tools (Andre Klapper, Quim Gil)

Within Wikimedia, there is quite some diversity in the needs for workflows and tools. One year ago, the main tool was Bugzilla, but also RT (?), Mingle (scrum board), Hello (scrum board), gerrit (code review), and then stuff to interact between these different tools, e.g. a hook to add gerrit comments to a bugzilla ticket. But this diversity in tools made collaboration between teams sometimes difficult and created more maintenance work. So the infrastructure team wanted to find something better. But avoid inventing yet another new tool.

Teams and developers were asked to describe their needs, which was consolidated into a list of must and would like, and also people could propose tools. Many options were pushed into a funnel to decrease the number of proposals and some test instances were set up. It quickly resolved to the question: shall we move to Phabricator? This was discussed in a broader Request for Comments in the entire community.

Phabricator comes from facebook has split off into a separate company Phacility and is built with PHP. It has most of the tools that you want integrated: bug management, project management, reviews, scrum views, …. Unfortunately, no migration scripts from Bugzilla existed. Since Wikimedia was doing new things (migration scripts), they also immediately worked with upstream. Upstream is really helpful but they have their own priorities.

Deployment steps: set up a production instance, let users pre-register in phabricator, migrate bugs from the test instance (linking to the already registered users), migrate the bugzilla data (including dropping some of the data). All of this was planned within phabricator.

Migration itself was pretty complex. Fetching tickets needs to work around known XML RPC bugs and takes 5 hours. Creating the tasks etc in Phab took 25 hours. Note that the IDs have to be remapped; they just added 2000 to the bugzilla number. After migration, the old bugzilla URLs were redirected to the Phas URLs. Lots of problems during the migration, but worked very well with upstream. They wrote a blog report about all the migration problems.

Gains in Phabricator:

  • Unified login: log in on wp and you get access to phabricator.
  • Nicer layout, configurable.
  • Workflow is simplified (no separate state and resolution fields).
  • Scrum boards, with columns not just on status but also other tags.
  • Burndown charts (custom extension).
  • Tasks can be assigned to several projects, no distinction between e.g. milestones and products, which gives a lot more flexibility (but you have to think about it a little).

Still todo: migrate code browsing and code review. That will be difficult. Also todo: migrate Jenkins.

Phabricator API is considered unstable, so if you do costumisation it is costly, and they don’t announce breakage. Contributing requires a CLA.

See https://www.mediawiki.org/wiki/Phabricator

QtCreator for uController development (Tim Sander)

QtCreator was originally an IDE targeted at desktop development, but it has evolved to desktop, mobile, embedded and bare metal. Since last years, it has suport for Android, C99, a code model in clang. It only supports gdb –with-python as the debugger though.

Bare metal development is for the small chips with no external memory: CortexM/R, with a hardware debugger (openOCD) and no OS or a very small one (FreeRTOS recommended). Otherwise, if Linux is running, then you can use the remote Linux debugging plugin.

The bare metal plugin uses a hardware debugger through gdb-server. QtCreator organises the targets by Kits – a Kit is compiler+debugger for a specific target. You have to assemble your toolchain into a BareMetal type Kit. Build system can be qbs (json-described build used by QtCreator), cmake, or qmake (but that needs a fake Qt and is not recommended).

New features in 2014: Fast Restart option, to avoid having to reflash, instead just reset the device. Recently added gdb provider support, which allows you to configure things from QtCreator itself – targeted for 3.4 release. To be added later: generic make support (which will make it possible to use this for Linux kernel debugging); device view (structured view of the hardware, so giving names to memory mapped addresses).

SDCC – Small Device C Compiler (Philipp Klaus Krause)

SDCC is a stanard C compiler (C89 to C11) for 8-bit architectures, mostly for freestanding implementation (hosted implementation requires more work). Includes supporting binutils and runs on many hosts. It has some unusual optimisations that make sense for 8-bit processors, particularly in the register allocator.

C standards give some leeway to the compiler that allow it to optimize better, and this is used by SDCC. C89 is complete except for doubles and struct assignment (you have to memcpy them explicitly). From C99, declarations in the code and VLAs are missing, as well as compound literals. But the stdint and bool types are supported. From C11, generics and unicode support are missing, but noreturn and static assert are supported and used for optimisation.

Many target architectures are supported, but usually not very complete because it is very difficult to generate code for them, e.g. because it is difficult to access the stack. On some architectures, it is possible to use complex instructions to do many operations with a single instruction, e.g. memcpy. Commercial compilers still perform better (in code size), but only by a margin of about 30%.

Optimisations are done first target-independent on the iCode, then target-specific, and then register allocation. After assembly generation, an additional peephole optimisation step is performed. The register allocation step is very important because spilling to memory means additional data memory and additional code to do the spill – even when doing direct-memory operations, the instructions that use registers are usually shorter than the instructions that use memory.

For these 8-bit targets, register allocation is complicated because there are different register types, and not all operations access all registers equally. Sometimes it is cheaper to recalculate than to store in memory. Also often a 16-bit register is aliased with two 8-bit registers.

The SDCC register allocator is based on graph-structure theory. It can do theoretically optimal register allocation in polynomial time, but that means it is slow for targets with many registers. Therefore, there is a speed/code quality trade-off option. Bytewise register allocation decides for every byte of a variable whether it goes to memory or register.

The project does regression testing of daily snapshots, running 7000 test from GCC and from fixed bugs. This is done for various targets and various hosts.