Impressions from the Embedded Linux Conference – Europe 2010

Here is a brief report of the sessions I attended at the ELC-E 2010. There were 280 attendants this year – 100 more than last year! The GStreamer effect?

These were the most interesting talks:

I also learned something from the talks about flash filesystems, but just the following: if you use squashfs, put it on top of a UBI layer because you need wear levelling even if you don’t write.

Linux in mobile – Ari Rauch, TI + demo of Pandaboard

TI believes in openness (well, starts to 🙂 because you need many brains to drive innovation. TI has made serious mistakes with respect to openness in the past (e.g. releasing to open source without upstreaming), but is committed to improving this. Requires transforming the company culture.

TI no longer tries to support Linux, WinCE and Symbian at the same time. Instead, focus on Linux. Still allow several distros (Android, MeeGo, Ubuntu).

No open source driver for the graphics core 🙂 Because this is not just TI, it’s developed externally and they’re still scared of opening up.

Pandaboard = OMAP4 community board. Android, Ubuntu, Gentoo, Angstrom, MeeGo, … JTAG supports opensource debugging. LCD connector, DVI/HDMI out, Ethernet, WLAN/Bluetooth, 1GByte(?) DDR2. Ubuntu compiles natively on the board; can be used as a desktop environment. Media acceleration available through TI ppas. Pandroid environment available from TI.

Supporting Maintainers – Wolfram Sang, Pengutronix

Wolfram is speaking from personal experience.

Basic assumption: mainlining is good. However, after sending a patch (that is checkpatched, compiletested and tested on HW), nothing happens. Why?

  • Subsystem is orphaned, official maintainer either has gone or doesn’t have the hardware.
  • Maintainer is too busy.

Methods to improve acceptance.

  • Complain. But that doesn’t help. Instead, think of what you can do.
  • Become a (co-)maintainer yourself. This isn’t often possible, because you don’t have time or expertise. Alternatively, you can offer to maintain an L2 branch for a sub-subsystem. Ask beforehand if this is wanted. Real maintainer can pull this branch after minimal review.
  • Resend. Check if the CC list can be improved. Check if you have responded to all feedback. Look at how other patches have been handled in the past, perhaps the subsystem has specific procedure (e.g. arm-linux has a patch tracking system, some subsystems have specific rules for formatting of Subject).
  • Be a member of the community, so people know you: test, review, and ack other patches, and answer other people’s questions.
  • Keep Acked-by, Reviewed-by and Tested-by tags in later versions and resends. Keep a changelog in later versions.
  • Add a cover letter to your patch set which mentions your public git branch and on which state you were based. Ask for review explicitly in the cover letter.
  • Rebase your patch on latest dev-tree of the maintainer (sometimes difficult to find it; if you find it, update MAINTAINERS; when in doubt, use Linus tree). Publish a git branch (for instance on infradead or pengutronix). Rerun your tests.
  • Give advanced warning ahead of time, with an RFC patch. This helps to avoid problems in cases of parallel development.
  • Subscribe to the mailing list before you start developing.

Looking at statistics, only a small percentage of patches has Acked-by, Reviewed-by and Tested-by tags. This really should improve. Developers should get time from their employers to do this – especially if more SoC vendors start mainlining, that means more reviews are needed.

V4L on SoCs – Hans Verkuil – Cisco/Tandberg

Originally, video devices were PCI, so single pipe of data. With SoCs, there are more complex dataflows, e.g. on OMAP3: separate blocks for input, scaling, colour balance, … which can be connected in different ways. For an embedded system, you don’t want to hide all this complexity in the driver (like you would on a PC). V4L is also for output, but on SoCs the overlap between video output and graphics output is larger.

Unlike other drivers, V4L doesn’t address a single chip but a lot of supporting stuff. For instance I2C and preprocessing blocks in addition to camera. So there is a main driver and a constellation of smaller drivers around it.

Since V4L didn’t support the SoC model, vendors started building their own – and didn’t upstream. The situation is improving (for TI). V4L was lacking a core framework, so every driver hat to reinvent the wheel and copied over buggy code from other drivers. 2009 (2.6.30) added a real V4L framework. Brainstorming session about memory management in March 2010. Many companies participated in these (ranging from Intel to Qualcomm).

These improvements were added (as of 2.6.36).

  • Timings API for HDTV.
  • Event passing API over the same file descriptor. You can select() on it.
  • Improved API for controls (e.g. lighting frequency). Core control framework makes it easy to implement this complex API, using kind of virtual methods in the core framework.
  • Buffering is work in progress. We want to be able to e.g. capture directly into an OpenGL texture. Avoid unnecessary cache invalidates and memory locking. Contiguous (physical) memory allocation – requires support from mm to free up the contiguous space.
  • Multi-planar framebuffers with different memory layouts is work in progress.
  • videobuf is being replaced by something completely new that actually works and is documented.

Infrared devices are part of v4l because they’re usually part of TV cards. lirc is out of tree so it is now broken. Most lirc drivers are moving into staging.

Core V4L framework (take a look at the slides for nice drawings).

  • v4l2_devnode: port to user space.
  • v4l2_device: top-level device, central control.
  • v4l2_subdev: specific block of the video subsystem, e.g. scaler, … This allows reuse between different SoCs that use the same blocks. All the work is done by the subdevs.
  • v4l2_ctrl_handler: both devices and subdevs have these. Device-specific handling of control API.
  • Subdevices can create their own device nodes, which allow complete control of the subdevice. This gives full flexibility while having a simple, standard v4l2 interface at the same time. Subdevices can add their own ioctls.

Media controller: gives internal topology of SoCs, including video devices, framebuffer, alsa, i2c, … It essentially exports the block diagram of the hardware to userspace. If all goes well, this topology could be exported directly to GStreamer to configure hardware pipelines.

State of Multimedia in 2010’s Embedded Linux Devices – Benjamin Zores, Alcatel-Lucent

This was in parallel with my own presentation, but I looked at the slides and it’s really useful! It covers two major things: how to select a SoC for an embedded multimedia device (and it mentions a huge number of SoCs), and the state of open source libraries to run on these chips (ranging from DirectFB to ffmpeg).

Lightweight Prelinker for Kernel Modules – Carmelo Amoroso / Rosario Contarino, STMicroelectronics

Module loading takes long (extending boot time) because resolving symbols take long. Idea: resolve symbols ahead of time, during build (as much as possible; if symbol is in another module, it can’t be pre-resolved). Relocation of symbols still has to be done at load time.

This is possible because in vmlinux, exported symbol addresses are absolute. Kbuild guarantees there are no duplicates. So module’s symbols are looked up in vmlinux ELF header, and if defined there they are resolved in the module and marked absolute. Also an empty .preresolved ELF section is added to the module, so module loader can skip the resolution step, or look only in the already loaded modules, not in the kernel symbol table itself. Obviously, this means that modules have to be recompiled every time the kernel is.

Compared to GNU hash (an optimized module loader), there is an overall gain of +- 10% in the total time spent on loading 100 modules. Compared to modprobe (which linearly looks up symbols), there is about a factor 3 gain in the total time spent on loading 80 modules.

Currently trying to upstream this.

Further optimization: strip symbol table from kernel, if anyway all modules are already preresolved. Another option: tell loader in which module the symbol has to be looked up.

GNU/Linux/Open Source on ARC – Mischa Jonker / Ruud Derwig, Virage Logic/Synopsys

[I was a bit distracted during this presentation so I may have missed some important points.]

ARC is based on SNES processor. It used to be just behind ARM as the top embedded 32-bit processor. Now a Synopsys DesignWare component. You can extend the processor with e.g. DSP instructions and multiport scratch-pads.

First step: porting GCC to ARC: write machine description, RTL generator, code generator. Proprietary assembler is still used. Some problems converting RTL to assembler, because some RTL instructions have to be split, but that means the branch delay slots change.

Branch delay slots are also tricky for context switching assembly: more registers to save.

Currently a sourceforge project, working on upstreaming. Convincing management is a hurdle.

Board Bring-up – Dave Anders AKA prpplague, TI (worked on Beagleboard and Pandaboard)

Collaboration between hardware developers and software developers. Board bringup is a big subject, you could fill an entire conference with board bringup issues. This presentation: list things that can go wrong at the hardware side.

Board bringup is like MasterMind: make guesses, for which you get partial information. The science is in making the right guesses.

What can go wrong in schematic and PCB:

  • Datasheet errors and omissions end up in schematics that are wrongly designed.
  • Cut and paste schematics from a previous design: it’s easy to forget part of the functionality of a block, e.g. a level shifter that is on a different page of the schematic.
  • Mode errors: e.g. I/O expander that supports both SPI and I2C, need to put a pull-up or pull-down to select the correct mode.
  • When new parts are used, sometimes the pins are not laid out in the correct place. Similarly, connector pins are sometimes swapped.
  • Sometimes the PCB layout people can still change something in the schematic, e.g. position of ground pins or polarity of USB lines. If these don’t get back-annotated in the schematic, you have wrong information.
  • Components are sometimes swapped or rotated because labeling in layout is too complex to understand.
  • Wrong parts are used, e.g. for wrong voltage.
  • “Do not populate” parts: if some of them have to be populated after all, things will almost certainly go wrong.

For debugging the board, you have a lot of hardware tools: oscilloscopes and stuff, but also e.g. loopback for serial, USB hub, I2C eeprom, to be able to test if the buses are working at all.

Debugging software: Sigrok (logic analyzer, see project page for hardware, including USB analyzers), OpenOCD (JTAG debugger), Gerbv (Gerber viewer), Devmem2 (access physical memory addresses from user space), fb-test (framebuffer test pattern), evtest (trace input events from keyboards, touchscreens, …), i2c utils (scan i2c bus).

Debugging procedure: Observe problem, formulate theory, formulate test, analyze results. Work at one problem at the time and break it up.

Time spent on board bringup is somewhat unpredictable, but if the design is based on an existing reference board, it shouldn’t be too bad. The condition is that you get all the information.

Linux as the Bootloader – John Ogness, Linutronix

Linux has drivers for everything, which are then ported to the bootloader.

Bootloader responsibilities: set up RAM, set up MTD, load and run kernel, provide failsafe boot. But other stuff is creeping into bootloaders: show splash screen, play music, user interaction, external devices, … Is it really necessary to have all this in the bootloader? Only advantage is that you have early access to the hardware – e.g. splash screen makes sense. But this is outweighed by the disadvantages.

Linux (minimal) as a boot loader. It has all the extra features that people want to put in the bootloader. It’s drivers are much better tested and complete. Use kexec to jump into the final kernel. E.g. load the kernel from USB when a service technician comes by.

A stage one bootloader is still needed to load the linux bootloader. This can be less than 16K. It must also initialize RAM and MTD, and provide failsafe boot. However, the stage one bootloader can’t simply read from MTD, because of NAND peculiarities: ECC, bad blocks, … So stage one bootloader must already support (read-only) UBI (Unsorted Block Images) access (but only for static volumes, no need to access a file system (ubifs)). To make this simple, just look at the checksum of the volume; if it is not valid, look for a more recent version of the module.

UBI-aware stage one bootloader fits into 12K

Example setup:

  • 1st partition: stage one bootloader, copied 4 times (in case one of the copies goes bad).
  • 2nd partition: UBI managed with 4 static volumes: stage one configuration, bootloader kernel, production kernel, failsafe kernel.
  • 3d partition: UBIFS rootfs

Exploiting on-chip mem – Will Newton, Imagination Technologies

Linux isn’t very good at making on-chip memory available to userland. Time-sensitive apps want to use on-chip memory (instead of cache/SDRAM combination) because it has guaranteed access time (cache miss can cost 100 cycles…). Also bus contention can affect other hardware (e.g. framebuffer). Finally, memory bus and SDRAM consume a lot more power.

On-chip memory is sometimes not even MMU-mappable. So it resides at a fixed address.

First solution: add sections to ELF executable, and modify elf_map in kernel to load these sections at the correct address.

For shared libraries, you can’t have fixed addresses. It’s easy to put the whole object in on-chip memory, but not parts. So only possible for small libraries. So if you want to put the critical parts of libavcodec on-chip, you need to split it into two libraries…

Dynamic allocation: add API to dynamically allocate on-chip memory area.

Branch offsets are an issue, because the on-chip memory is mapped far away from normal memory.

Internal memory (behind MMU) is simpler: can use same approach as NUMA architectures. You add a cpu-less NUMA node and use set_mempolicy. Allows to dynamically define policy for future page allocations (of this process). mbind is a similar system call, but that allows moving pages.

Upstreaming: problems with management…

ARM Flattened Device Tree Status Report – Grant Likely, Secret Lab Technologies

FDT is alying the foundations for future ARM distributions. FDT is a binary format for representing the OpenFirmware Device Tree. It is passed to kernel at boot time, instead of hard-coding BSP. Easier to manage different versions of the board. Also good for supporting a large number of boards in the mainline. Proof that this works on x86. It also allows to pass data from boot loader or other firmware to the kernel.

DT = tree with key-value properties in the nodes. Also secondary links in addition to tree links – called phandles. Values can be strings, uint32’s, and lists of uint32’s. “compatible” properties identify the devices. Details on

It doesn’t replace board-specific code in the kernel, e.g. specific device drivers. It’s not a boot architecture (BIOS), but can be used to have a generic kernel that gets its board support from the boot loader.

.dtb file can be passed like an initrd. The boot loader can get this from flash, or modify it, or generate it (but avoid that, because it means you may have to update your bootloader in the future).

Device tree does not correspond to the linux bus/device tree. This is intentional, because the linux internals may evolve more frequently.

There were already three copies of FDT in the kernel: powerpc, sparc and microblaze. This has now been factored out and cleaned up. MIPS support was added in .37. ARM support has not been mainlined yet, expect it in .39 or .40. Devtree in git:// branch test-devicetree.

For PowerPC, there is a convention about the properties (EPAPR). ARM adopts most of that.

Barebox boot loader – Robert Schwebel / Sasha Hauer, Pengutronix

In industrial and automotive environments, there are really fast boot requirements. E.g. response to CAN message within 200ms after power-on.

Typical embedded PC (800MHz x86) with pre-installed Linux: screen starts with a lot of flickering, after 30s the user interface comes up.

Barebox: fork of u-boot, maintained by Pengutronix. Monthly releases. git master branch, next branch; next merges into master just after release, then master just accumulates fixes until next release. Hardware: ARM (at91, imx, omap, …), blackfin, m68k, ppc (mpc5xxxx), x86 (bios based).

Tries to use Unix concepts. E.g. real scripting. Uses Kbuild and Kconfig. Boots from NAND, UBI, SD. Framebuffer support for splash screen. Instead of environment, it has a RAM filesystem that can also be written back. Support for modules.

Booting fast.

  1. First there is ROM boot, which reads boot block from NAND or SD. This can already take half a second… So important to choose components that don’t delay things. Select a CPU that is optimized for fastboot. To test, monitor reset line and a GPIO with an oscilloscope, toggle the GPIO in the boot code and measure the delay.
  2. Boot loader initializes hardware and fetches kernel. Initialize only the hardware you need. Tune clocks and timings to have fast access to e.g. SD. For fetching the kernel, do read/decompress asynchronously (not in barebox mainline yet; needs futures in the flash driver…).
  3. Kernel extracts itself, initializes hardware and starts userland. Booting uncrompressed image may be faster if the CPU is slow. For speeding up kernel boot, see Sometimes initramfs is faster (or even have the entire rootfs in initramfs), sometimes real rootfs is faster.

Tuning these speed-ups kill maintainability. You can demo a fast boot really quickly, but then making a real application turns out to be a wholly different matter. Indeed, the reason to use Linux in the first place is to use all the libraries available, so you end up needing a full system anyway.

Benchmark on imx35: 200ms from reset to start of bootloader; 330ms from start of bootloader to start of init. This is with mainline kernel and barebox configured down to a minimum, so not using too many of the tricks above.

Tips for making it fancy:

  • Tell the hardware people to leave the backlight off. In software, first display the splash screen and then turn backlight back on.
  • Make sure that framebuffer has a fixed adress between bootloader and kernel, and don’t let kernel re-initialize the framebuffer.
  • Let bootsplash use overlay framebuffer. Then let the Qt application start in the background (on the primary framebuffer). When it is ready, crossfade the overlay framebuffer away.

If this still isn’t fast enough yet: Boot Time Critical Services. Set aside some memory for the service and register a poller in barebox, which allows the service to run. Hand over the memory to Linux and the Linux driver takes over the functionality in its ISR. Of course, it means you have a bare metal stack both in barebox and in Linux.

Running your own GSM+GPRS network – Harald Welte, OpenBSC

Although GSM is publicly available standard, there is almost no research on GSM security, while TCP/IP gets a lot of scrutiny. This is because the industry is extremely closed, only 4 closed-source implementations exist. Even the companies building the phones don’t get the documentation of the baseband chips: they just get the operating system kernel from the baseband chip provider. Also only very few companies sell GSM network equipment (only femtocell has more vendors). Only operators buy equipment from them. To just set up a mini-GSM network for experimentation, you need to spend 400K minimum. Operators are like banks: they outsource everything. Very few people understand what’s really happening on the protocol level, they just know how to operate the proprietary devices.

GSM security is not just about listening to phone calls. E.g. BMW opens your car door via GSM; European Train Control System is based on GSM.

Open source implementations allow more people to learn about the protocols and experiment with them. Difficult to get started, because of the closed hardware. Also there is no literature, you just have the >1000 PDF documents that define the standard.

Look at slides for a very quick intro to the GSM protocols.

OpenBSC implements everything in the GSM network that is behind the base station itself. So you just need a base station (BTS) (starting at about 3500 euros) and a normal GSM handset to start testing. OpenBSC is implemented in C, a bit Linux-like but in userspace. There is now an operator that uses OpenBSC in 25 base stations. Suitable for small networks only. Integration with Linux Call Router (lcr) to use it as a PBX system and route calls over landlines into the POTS.

In the handset, you have a baseband processor that implements the GSM. It’s typically an ARM7 with a small RTOS or no OS. No security (like NX or even MMU), and it’s exposed over the air interface. To test, you either need to reimplement the baseband system yourself, or hack into an existing baseband system. The latter is less effort, you just need to do reverse engineering or use leaked documentation. OsmocomBB reimplements the baseband software from scratch on a TI Calypso. Reuses part of OpenBSC. It’s distributed between the actual phone (only layer 1) and the PC (does most of the work). Serial lines are multiplexed over audio jack. Currently, you can do voice calls with it. Don’t have Tx power controll, cell hand-over, GPRS.

PulseAudio in the Embedded World – Arun Raghavan, Collabora

ALSA is a low-level sound API, not very application-friendly. PulseAudio adds more features. It’s a sound server. Simple API: stream object and you write audio to it (blocking). Async API: stream object, prepare data and run; callbacks that ask for more data.

PulseAudio allows per-application volume controls, and flat volume control (controlling all levels simultaneously). Allows moving streams between output devices while they play. Metadata based intelligent output selection, e.g. route VOIP to headset instead of speakers. Backend can be ALSA (or others on Mac/Win), Bluetooth, network, …. Plugins (modules) for things ranging from echo cancellation to udev hotplug.

Scheduling. ALSA is based on a “period”, which is fixed for the whole of ALSA: when a period has passed, ALSA wakes up the program to supply more data. Period can’t be made too large because it adds latency. So there will be many wake-ups. PulseAudio turns this around and wakes up on a timer to fill the buffer. The advantage is that you can configure the latency dynamically, very fine-grained.

Flash File System Benchmarks – Michael Opdenacker, Free Electrons

Many flash file systems, which one to choose?

  • jffs2: today’s standard. Long mount time, but can be reduced with an index. Compresses.
  • yaffs2: fully featured, but not in mainline yet. Fast mount time.
  • ubi/ubifs: separate management of erase blocks and wear levelling from filesystem.
  • logfs: didn’t manage to get it to work (unmount).
  • squashfs: doesn’t support bad blocks, so it breaks down even if you don’t write. So put it on top of UBI – only then you need the mtdblock layer in-between, which is bad for performance. There seem to be some patches out there to make a ubiblock device.
  • nftl: direct block device access to NAND. Don’t use.

Since presentation at ELC-E 2008, free-electrons got an CELF sponsorship to make automated, regular tests of the different filesystems and make the tests more reliable. Debian squeeze armel rootfs. Scripts to send commands to the boards over serial. Timings use ‘time’ around the script.

Results: see the slides. Differences in boot time are not huge, except that jffs2 mount time is much larger. Read and write time for jffs2 is very bad for large partitions. Remove time (directory manipulations) is surprisingly high for yaffs2.

Conlusions: jffs2 is good for small partitions (compression!). Between ubifs and yaffs2, the jury is still out. Right now it looks like ubi is better for large partitions, but this may change.

YAFFS – Wookey / Charles Manning

YAFFS was the first NAND flash filesystem, log structured. With larger devices, the original design became inefficient, so in came YAFFS2. Checkpointing was added later to avoid long mount times. Exists in many operating systems because the code is split into a generic layer with on OS-specific layer on top. This allows the development to happen in userspace, without actual flash.

Since some serious bugs were discovered in 2008, a lot of testing was added (on simulated flash). Includes simulation of firmware update with power failure. Since then stability has improved a lot.

Recent improvements:

  • Garbage collection now works in the background instead of on every write.
  • Tuning garbage collection has become easier, by keeping track of free space (unused pages) and erased space. Only erased page can be used immediately. Based on these, have fast GC if there is a big difference and slow GC when the difference becomes smaller.
  • Block refreshing: NAND bits degrade over time if not rewritten. Therefore, also old blocks are refreshed (as a side-effect of GC).

YAFFS is 9 years old but still not mainlined. There is now sponsorship to get it mainlined.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s