Author Archives: Arnout Vandecappelle

Reports from Open Source Summit / Embedded Linux Conference Europe 2018

As usual, I’m writing reports of the sessions I’m attending at Open Source Summit / Embedded Linux Conference / IoT Summit Europe 2018. This time, however, I’m doing it on the Mind website:

You can also find other reports there, e.g. of Kernel Recipes in Paris.

What’s New with ftrace? – Steven Rostedt, Vmware

This is an update of what changed in ftrace since Steven’s presentation about its state in v3.18.

What ftrace already did: function tracing, function graph tracing, snapshots, trace events, triggers, and debugging. It uses the hooks installed by gcc’s profiling. Since gcc 4.6, gcc has the -mfentry modifier to call fentry() (instead of mcount()) before the stack frame is set up, so that you get access to the parameters. This is used by live kernel patching to replace a function. The mcount/fentry call adds about 13% of overhead, so they are replaced by NOPs so the overhead becomes unmeasurable. When tracing is enabled for a function (set_ftrace_filter), the NOPs are replaced by the original jump.

In addition to filtering on function name, you can filter on PID or children. Triggers define additional things that get done when a trace point is hit: saving a stack trace, save the trace buffer (= snapshot), turn trace off or on. Also profiling info can be gathered: hit counts, stack usage. max depth limits how deep you trace (starting from the switch from userspace, i.e. syscall or page fault or interrupt).

Instead of tracing on functions, you can also trace on trace events (trace events are added by ftrace to each trace point; the trace point is defined by the source code and gets used by ftrace, perf, …).

For debugging: trace_printk prints to the trace buffer instead of the log buffer, requires no locking etc. A sysctl allows to dump the trace buffer on panic.

Since 4.0: when two different tracers are attached to different functions, a separate trampoline is instantiated for each. This avoid iterating over the tracers which is quite expensive. Before, the trace function would be called directly but only if only a single tracer is in use (globally). Now, it can still be called directly as long as it’s on a different function/tracepoint. To be able to do this, the ftrace infra needs to allocate space for the new trampoline and make it executable, so it’s a bit complicated.

Also added in 4.0 is NOT in the trace event filter logic. It existed for functions but not for events.

Also added in 4.0, the very dangerous tp_printk kernel command line option. It will use printk() instead of tracing to a buffer.

[missed some things that went to fast]

Since 4.4 it is possible to filter triggers on CPU# or PID.

Since 4.4 it is possible to set tracing options before the tracer is enabled, so you can do e.g. filtering immediately.

Since 4.4 it is possible to filter functions by module (or not module) with :mod:modulename. Including *:mod:!* to trace all functions not in modules. Since 4.14 it is also possible to do this before the module is loaded.

Since 4.4 there is a separate filter file for PIDs, which overcomes the limit on the string size on the generic expression filter.

Since 4.9 the event PID filter is extended to also trace children of that PID, so you don’t need to add each child as it appears. Since 4.14 also for functions.,

Since 4.14, the filters have full glob support, also * in the middle and ? and [].

Since 4.14, a hardware latency detector is added to detect SMIs/NMIs, by running loops with interrupts disabled. You give it a width and an interval (called window).

What’s coming: module init functions, …

Steven ran out of time so not everything is covered. Check the slides.














Measuring the Impacts of the Preempt-RT Patch – Maxime Chevallier, Smile

Maxime worked on several projects involving Preempt-RT:

  • Simulation on PC of a real-time system, needed to do real-time response on a network interface.
  • Test bench interfacing with real-time software that needs to react within 1 second but has a lot to do in that time.
  • Embedded telematic board: must never loose an incoming message. Since the customer could add CPU load, RT patch was needed to make sure message handling has priority.
  • Medical image processing: need to process each frame before the next one comes.

Real-time = deterministic behaviour: bounded latencies, absolute priorities for tasks (SCHED_FIFO, _RR and _DEADLINE), handle complex cases like priority inversion (rt-mutex with priority inheritance), starvation, …. Most of this is already in upstream Linux. What Preempt-RT still adds: full kernel preemption; various optimisations for worst-case scenario instead of common-case scenario.

Full kernel preemption consists of forcing threaded interrupts (so we get priorities for interrupts as well), making locks sleepable (spinlock normally doesn’t allow anything else on the same CPU; sleepable lock will yield when it doesn’t get the lock). Nothing else changes, so all the normal Linux OS is still there. Only the non-RT tasks will have to live with what is left over by the RT tasks.

To analyse the effect of the Preempt-RT patch, use tools like vmstat, mpstat and pidstat. E.g. mpstat shows how many interrupts each core handles. However, take care because they show results differently. For example, without threaded interrupts, interrupts are not counted as context switches in these tools, while with threaded interrupts each interrupt gives 2 (non-voluntary) context switches (one to the interrupt and one back).

As a benchmark, use stress-ng with a fixed number of operations and measure execution time. Just CPU makes no difference. “fault” (that triggers page faults) is significantly slower. So you need to test this. Note that stress-ng contains cyclictest as well.

In addition to applying preempt-RT, you need to do more things to improve predictability:

  • Disable deep-sleep CPU idle states (this increases power consumption). Tweak with cpuidle in /sys/devices/system/cpu/cpuX/cpuidle/stateX or in BIOS.
  • DVFS: use a fixed frequency
  • Disable hyperthreading

Clearly, you need to know the system. E.g. DMA can give latencies on the SoC bus. SMI is not maskable (it does thermal management…) so measure how long it takes. Hardware resource sharing (e.g. SIMD unit shared between different cores).

Linux Storage System Bottleneck for eMMC/UFS – Bean Huo & Zoltan Szubbocsev, Micron

Bean and Zoltan (the speaker) work at Micron in the embedded business unit, in storage software, often in automotive. As part of this they have quantified the storage system overhead in embedded systems for access to eMMC, UFS and NVMe, i.e. how much of the speed provided by these storage technologies can actually be achieved by userspace. Also quantify the overall performance improvement of NVMe over UFS. But comparing is difficult since you can’t have a NVMe device that is fully equivalent to a UFS device.

All three technologies are a NAND chip with a controller and firmware. eMMC can get 400MB/s at its interface, may go up to 566 in the next generation. UFS Gear3 can have two lanes of each up to 728MB/s. NVMe Gen3 1000MB/s per lane.

For tresting they use Fio in single and multi-threaded mode, always using DirectIO or sync IO. Using function_graph tracer and blktrace. The trace points allow to measure latency: from user space submission to BIO submission; from BIO submission to storage device submission; for data transfer up to completion algorithm; from completion interrupt to block layer completion.

They did experiments on two boards: a somewhat older 2xCortex-A9 Zedboard, and a newer 4xCortex-A57 Nvidia board. On the Zedboard, eMMC performance is completely dominated by software overhead, ranging from 63 to 92% of the total latency spent in software. To some extent this is caused by the cache invalidation which takes a long time in Cortex-A9. On the Nvidia board, performance is a lot better for large sizes (12% to 39%), but stil significant for small 4KB request sizes (up to 72%).

Experiments for UFS and NVMe have to be done on different boards. For 4K write there is still significant (60-74%) overhead. The graphs showed that with 8 threads the latency is significantly decreased; someone in the audience suggested that this is due to interrupt coalescing, which amortizes the interrupt time over all threads. Even for 128K accesses the overhead is non-negligible. The results show that the overhead of NVMe is indeed significantly lower, e.g. for 4K random write the overhead of NVMe is only 66% that of UFS. To estimate the system-level performance, they used some formulas that I didn’t understand but that result in 8-25% speed difference between UFS and NVMe. This difference is not so much because of faster NVMe, but mostly because the Linux stack is better. However, the hardware queue size is also a factor: NVMe can support a lot more outstanding tasks. For very high thread counts, UFS performance starts to drop while NVMe sustains.













printk() – It’s Old, What Can We Do to Make It Young Again? – Steven Rostedt, VMware & Sergey Senozhatsky, Samsung Electronics

Sergey made the patches, Steven is the reviewer, so Sergey did the presentation.

printk() is complicated. It takes a number of locks, which ones exactly depends on your .config. In addition, you can do printk() from an NMI that interrupted an NMI. It is easy to deadlock, e.g. printk() may take the scheduler spinlock, so you can’t use it in the scheduler when that lock is taken already. lockdep as well: reporting a deadlock will call printk() again. Therefore, for a long time printk() idd lockdep_off() and disables the RCU validator.

So printk_safe was created. This allows printk() to be reentrant, enables lockdep again in printk(), and generally make printk() less deadlock prone. But it’s not quite reentrant yet. printk_safe() can’t be called from sleeping context.

The fundamental prolbem is that printk depends on two different types of locks: locks internal to printk, and locks that come from somewhere else (e.g. locks in the serial driver). A solution would be to do printk_deferred() everywhere, which means just one internal lock. But the actual printing has to be done somewhere, and in bad lockup scenarios there is no guarantee that e.g. IRQ context will ever arrive again. Alternatively, we could fallback on early_printk which doesn’t do locking but breaks e.g. dmesg. For example, add a write_on_panic() callback to the console struct that is completely lockless and can’t be called from any context except panic().

There used to be zap_locks() in printk that would drop locks when it detects recursion, but that only looked at the internal locks while the external ones are actually the tricky ones. Since there is now a better way of handling recursion, the function is removed. However, the zap_locks() approach could be used in console drivers. Add a zap_locks() member and call it from panic context so that you can re-enter the write() function.

It could also be possible to remove locking from the console drivers, i.e. the console functions don’t do any locking themselves, the callers do, by calling a lock()/unlock() member.

console_sem is used for a lot more than printk(). printk() uses it to make sure the print happens only on a single CPU. The console also uses it to handle line wrapping, UTF8 encoding, avoid mixing with TTY processing, cursor blinking, avoiding race between printk() from user context and printk() from IRQ, … Also for non-printk() related things: power management, adding/removing consoles. Some of these can schedule() with console_sem held. This means that in a livelock situation, printk() won’t come out again because it won’t acquire the console_sem.

The problem is that printk() is a mix of different subsystems: framebuffers, serial ports, TTY, sched, timekeeping, …. Maybe it’s time for a new printk() API. Change into a polling API: printk just goes into a buffer, and the consoles poll the log buffer. This allows to remove console_sem. This is in fact how serial drivers work already: they transmit xmit characters out of their buffer. This might even work with not all consoles working in polling mode. The problem is that there is no immediate flush, so when printk() returns the message is not printed yet.

printk() makes sure there is just one CPU printing the buffer, the other CPUs just append to the buffer. The printing CPU continues printing until the buffer is empty. However, if the printing CPU is in atomic context, IRQ, you’re adding unbounded latency which is some kind of lockup. So you’d like some preemption points in there. But that doesn’t solve all problems, and also slows printk() down which is an issue for e.g. OOM print which slows down OOM killer. So for some people it’s actually making things worse. Again, polling could be a solution, but now from printk_thread, but that didn’t work. People really want to have direct printk(), because you want it when your system dies.














syscall_intercept – A User Space Library for Intercepting System Calls – Krzysztof Czurylo, Intel

Krzysztof is in a team that mostly works on persistent memory programming. syscall_intercept is a satellite project. Source on

libpmemfile is a fully userspace filesystem (with persistent memory as backend), so not FUSE-based, nothing goes to the kernel.  syscall_intercept is part of libpmemfile. It patches all the system calls and replaces them with jumps to a hook function. So it’s like LD_PRELOAD but then for syscalls instead of libc functions.

syscall_intercept patches the code. To be able to do that, it first disassembles the code to find the syscalls using libcapstone, then find their context (not always trivial/possible), and hotpatch the code with a jump. It only patches libc – in most cases that’s the only one doing syscalls, but it’s also possible to patch the entire .text in the binary (except libsyscall_intercept itself and libcapstone). There is a single syscall hook function that checks the syscall number argument to decide what to do.

Capstone is an open-source disassembly framework. It is used to iterate through all instructions and evaluate if it is a syscall. Also the next instruction has to be evaluated to see if it is relocatable, and if it depends on the instruction pointer. The call can’t be replaced with a direct call or jump to a C function due to the argument and stack prologue, so there is a wrapper routine to set that up. For each syscall instance, a wrapper is instantiated that jumps back directly to the original address, which avoids problems with stack etc. Since on x86_64 a syscall is 2 bytes while a long jump is 5 bytes, you need to make space. If the subsequent instructions can be relocated, then they are put in the wrapper instance. Else, they look for a nearby hole of 5 bytes and issue a short jump to it. And there are a few other solutions too.

SYS_clone is a special case because you have two processes with a different stack pointer, simply restoring registers doesn’t work. So there is a complicated workaround. Also problems with rt_sigreturn and ptrace that don’t have a workaround.

syscall_intercept is just an SDK. You need to write a library that is loaded with LD_PRELOAD and that does something useful in the syscall hooks. To avoid loops, any actual syscall made by the library has to use syscall_no_intercept() instead of syscall().

This can be used for example to make a replacement of strace() that doesn’t make any extra syscalls, just logs every syscall. This is one of the examples in the repo.

Problem when running the program under GDB: you don’t want to instrument gdb.

Code is patched only once, so generated code or dynamically loaded code is not hooked. Also handwritten assembly that uses some tricks or non-standard ways of issuing a syscall could be problematic.

Other things you can do with this library: Error injection, a faster strace, userspace device emulation (which is basically the libpmemfile use case). Also, the same approach could be applied to other instructions than syscalls, as long as they are recognisable in assembly.

syscall_intercept is currently x86_64 only. It could be extended with other arches supported by libcapstone, but that would require supporting their syscall interface.

Interesting question from the audience: could the vDSO approach have been used instead of hotpatching? The speaker nor the audience knew an answer to this.

Buildroot: Making Embedded Linux Easy? A Real-Life Example – Yann Morin, Orange

Yann works for Orange and develops set-top boxes in three teams in two locations. Most are application developers, not Linux or embedded experts. The main part of the firmware comes from third parties. To put this together, they need a generic build system that is not dependent on the target and middleware. It should be easy to use and not take too much build time. A home-grown build system was tried before but no success. The build system provided by the provider of the middleware is very specific, for a specific target, not generic enough.

Evaluated build systems:

  • OpenEmbedded: distro generator, steep learning curve and no in-house knowledge.
  • Buildroot: firmware generator (= what they were looking for), moderate learning curve and in-house knowledge (Yann), extendable (BR2_EXTERNAL)
  • Others: no community (except OpenWRT).

Buildroot is simple (package in a few lines) and efficient (doesn’t take longer than absolutely necessary), is entirely community-driven (no companies behind it), community resources like website and manual.

Build process: make .config file, build toolchain, build packages, run finalize step that cleans up unnecessary cruft, generate filesystem images (with hooks at various steps). Package build process: download, extract, patch, configure, build, install (with hooks before and after each step).

External is the place for costumisations. Use Buildroot as a git submodule.

Config files are saved as defconfigs under the configs/ directory. Some people use defconfig snippets and stack them together to make different variants of the board. However, that way you can’t use the Kconfig UI, it only allows to save a total defconfig.

New packages go in packages/, new filesystems use go in fs/, all use the same Buildroot syntax.

For fine-tuning, you can use a custom skeleton and overlays. However, don’t use overlays too heavily. Preferably create a package for them, even if it just copies stuff to the target. That allows to use all of the possibilities of Buildroot, e.g. taking into account size.

In the external tree, it is also possible to add extra logic, e.g. a custom make rule to check that there are no circular dependencies. Or add a hook that is run after every package build or just before creating the filesystem. Preferably move things to helper scripts and don’t do it directly in make, much easier to maintain (e.g. syntax highlighting). It is even possible to add infrastructure. Orange added a orange-package infrastructure that adds some features, like installation of documentation.

To avoid creating too many make variables (which makes ‘make’ slower), avoid defining new variables in the infrastructure. Instead, use as much as possible generic variables. In addition, it is more readable since you don’t have to double-dollar to escape the call to eval. Adding such internal infra makes it much easier for developers to add packages.

UIDs can be created automatically by Buildroot. However, that means that adding a package could change the UID chosen. To avoid that, Orange packages must declare the UID explicitly. The orange infra checks that it is explicit.

For D-Bus, there have to be authorisation files that allow a specific application to access specific objects on the bus. That is tedious to declare. So Orange has a script that scans the source code and looks which access each package needs, and creates the authorisation file automatically. There are some exceptions that have to be handled, that is done with extra variables in the package .mk file. AppArmor uses a similar approach. This is also a reason to not put things in an overlay but in a package: that way, all this magic can work.

Orange has added a lot of infrastructure to automate things without need for the developer to take (much) action. It makes sure that these things are done systematically, reproducibly and maintainable.