Writing the code is only half the story. To make something that actually works for the user, it will have to be tested, debugged and probably optimized. That is the focus of this part of our series of articles about embedded software development using FOSS. The previous part discusses the software development process support tools that are suitable for embedded system development with FOSS: version control systems, issue tracking systems, documentation systems, managing the build process, and managing releases. The next, final part will discuss FOSS libraries that are relevant for many embedded devices.
- 1 Testing
- 2 Debugging
- 3 Optimization
Testing is essential but often difficult to get your employer to understand. Start thinking about tests (including how much testing is required) before the project starts. Once you start coding, there is a much higher threshold to add automated tests.
Unfortunately, most open source software doesn’t have good testing. Even in software that prides itself to pay attention to testing (e.g. GStreamer), unit test coverage is very low and there are almost no integration tests and regression tests. Most software has no tests at all, including the Linux kernel. If you add features to such software (e.g. add a driver to the kernel), the easiest is usually to focus on integration tests from user space. For a device driver, for instance, you can write a user space program or script that triggers various features of the device. Also you can sometimes do stubbing, for instance for a USB gadget function you can use the dummy gadget HCD, for an I2C device you can use i2c-stub. You can also add self-tests to the driver and enter testing mode through a module parameter.
A unit test checks that a specific function or set of function works as expected. Unit tests are a very good way to make a specification before developing: up front, you write down the expected behaviour, and afterwards you have a test that checks if the implementation behaves as expected.
Write unit tests before writing code. It can’t be stressed enough.
A problem with unit tests, especially in embedded code, is that they should work in isolation with a single module. You therefore often need to write stubs to fake inputs. This ties in with the simulation issues discussed below.
Don’t try to write unit tests in a black box fashion. It is perfectly acceptable to add support for the tests directly in the code (but of course it must not be compiled in in production releases). Assertions are one example, but you can also insert print statements to generate output for the test. Kprobes are another example of breaking the black box. The unit tests should also have access to the internal data structures of the code itself. There are two ways of achieving this: either export internal data structures in the headers, or put the unit test directly in the source file.
Unit tests are inherently unstable because they are tied so tightly with the function they test. When the function is modified, it’s very likely that the output also changes. Therefore, the test must be modified as well. A small change in the function (e.g. formatting, order in which things happen, …) can have a big effect on the actual output of the function. If the test is done by simply comparing actual output with expected output, this small change can lead to a big difference in test result. You should then check for all the new output whether it is still correct, but in practice that is too much work. So you’ll tend to just copy the new output to the expected output after a partial manual verification. To avoid this problem, it is better to check on some properties of the output instead of its value. These property checks can still be expected values that depend on the input, but are not simply expected output in a particular format. For instance, after some inputs there should be exactly that many records in the database (but you don’t check in which order), and a record with these specific fields should be present (but you don’t check all fields). Clearly, direct access to the module’s internal structures makes these property checks a lot easier. Finally, it is convenient to combine the two approaches, because the difference between actual output and previously expected output usually points you directly to where you introduced a bug.
There can be bugs in the test code itself, so it also must be tested. A minimal test is that the test should fail when you haven’t implemented the function it tests yet, or when you implemented it in the wrong way. Usually, it isn’t necessary to include the testing of test code in the automated tests, but it can be useful for very complex test code to also have some tests that are expected to fail.
Try to plan an extra week of test for every 1-2 man-months. That allows you to add the unit tests that you didn’t add while developing because you were in a hurry to get some feature implemented.
Some people say you should never make a commit with failing unit tests, but that’s simply wrong: it demotivates committing and writing tests. Instead, you should never try to go to integration with modules whose unit tests fail.
Code coverage is a way to evaluate the quality of the unit test framework. As an absolute figure, it doesn’t say so much: it just says which lines of code have never been tested, but not how well the other lines are tested. One rule of thumb, however, is that when you add a feature, at least some lines that you added should be covered by a test. gcov is usually sufficient for your code coverage needs. (K)CacheGrind can also be helpful.
Integration testing (also known as functional tests or systems tests (although these terms have slightly different connotations, we make abstraction of that here)) validates that all the modules combined perform the expected behaviour. The goal is to detect problems in the interaction of modules or functions. Integration testing is usually done in a big bang approach: everything is thrown together and tested as a whole.
Although integration testing is inherently black box (because you want to test the whole thing), it is usually necessary to break in into the abstraction to facilitate automatic testing. For instance, you may want to be able to start directly in a state that is reached after a year of operation. Or you may need to jump forward in time, or to generate an error condition that doesn’t easily occur naturally (fault injection). And of course also for integration tests you’ll still need some stubbing.
Some of the integration tests will always be manual. Indeed, because the automatic tests are error-prone as well, you still need some human brains to interpret the system’s behaviour. For instance, you need to watch some TV to check if the quality is OK. Also, creating automatic integration tests can be so complicated that is just doesn’t pay off. Finally, you need a human to invent new strange things that can be done to the system outside of its normal specification. Still, for all of these manual tests, it is useful to formalize it to some extent: which kind of things should be tried (and which shouldn’t because they’re already tested automatically), what you expect the system to do, and also how much time should be spent on the testing.
Some specific types of integration tests:
- Error handling test: what happens in error conditions? E.g. out-of-memory, disk full, disk/flash failure, network unreachable, power failure. Also think of combinations of errors.
- Stress testing: what happens when the system has to process too much? E.g. many parallel requests on the web interface.
- Deployment testing: do upgrades work properly? Is the image written properly to flash?
- Soak test (= duration test): what happens when the system stays in operation (under normal load) for a long time? This is particularly relevant for embedded systems which typically have long uptimes. Clearly it is something that takes a long time, but you can detect problems earlier by:
- putting the system under a high load;
- putting the system under a varying load, since many problems are triggered by starting/stopping componenents;
- monitoring the system closely, to detect changes in memory or disk usage or performance and to detect unusual messages in the log files.
- Monkey test: randomly generate inputs and see if the system survives.
Integration tests are not written up-front like unit tests. The expected functionality can be specified in natural language, but is usually too vague to write down as expected output and will probably anyway still evolve during development. Still, integration tests can be developed in parallel with the software (e.g. by a test team). This facilitates continuous integration.
Regression tests make sure that once you solve a bug, it doesn’t return later. The idea is that for each bug, you create a test (or a set of tests) that fails while the bug is there, and no longer fails when the bug is fixed. The regression tests can be at the unit level or at the integration level – often for a single bug, you’ll have one regression test at integration level, plus a unit test for each module involved.
Since there are one or more regression tests for every bug reported (and often also a few added by the developer himself because he discovered problems during development), the regression test suite can be very large. Therefore, it is not convenient to include the regression tests in every build during development. Instead, the regression tests are run automatically every day or week, and/or manually during preparation of a release.
In practice, the regression test can usually be made after the first release, by splitting up the existing test cases into the basic build test and the regression tests.
Since regression tests are created based on bug reports, and since they are not executed all the time, it is usually more convenient to make them black box even if they are unit tests.
It’s usually a good idea to include non-functional tests in the regression tests suite. Examples: size of the executable or image, performance of some critical parts, memory usage (can help to detect memory leaks). These don’t result in a direct failure, but just in a set of numbers of which you can track the evolution during development.
Regression tests don’t necessarily need to succeed for a release to be made. There can be bugs which are not fixed yet, or where the fundamental fix would take too much work. There can also be mistakes in the test itself (which are not immediately discovered because the regression tests aren’t run all the time). Also, there can be a regression in the non-functional tests which is acceptable (e.g. code size grows).
Static analysis tools help avoiding bugs while writing the code. The compiler itself is your first static analysis tool: it detects suspicious constructs, e.g. calling a function with the wrong type of parameter or having an assignment in an if statement. lint is a textual analysis tool that somehow has lost traction. The kernel has scripts/checkpatch.pl that (like lint) checks whether the coding guidelines are satisfied. Use checkpatch.pl. Coccinelle (une coccinelle is French for ladybug, an insect that eats bugs) is a promising static analysis tool that detects common bug patterns (but it currently only works on the kernel).
The acceptance test is a specific type of integration test. The acceptance test specification should not only contain the inputs and the expected behaviour, but also the environment. For instance, when developing a driver for a wireless chip, the acceptance test could start with no active base station in the vicinity, and continue with the base station being turned on, at which point the device should automatically connect to it.
Built-in self test
Built-in self tests verify that the hardware is still working properly. It is used to detect defects due to ageing or damage. The tests can be run at boot time, and/or at regular intervals, and/or when specifically requested. When a test fails, this is reported in some way and usually part of the system’s functionality is turned off.
Built-in self tests are rarely used to test the software itself, since that is expected not to change (so the BIST would also give the same result). Still, there can be tests of the data integrity. Failure of such a test can indicate a bug in the software, or that the data is corrupt due to some hardware error in storage or communication, or because there is a problem with another device on which this one relies.
In embedded system, the development environment is usually not the same as the target system. Even if the target system is an embedded PC, it will usually have specific peripherals or attached devices. That means you can only test the software on the target system. That’s not very practical, because the target system is usually slow and resource constrained, because it has less debugging tools available, and because cross-compilation and copying to the target system takes more time.
Therefore, for unit testing and for some integration tests, you should use some form of simulation. Three forms of simulation exist. An instruction-set simulator simulates each instruction of the compiled software. A high-level simulator (=emulator) (VirtualBox, QEMU) executes x86 instructions but simulates the environment (peripherals, memory model, …). A stubbed simulator is custom for the application and replaces all target-platform specific parts with stub code. Even the operating system can be stubbed out in this way. Of these, only the emulator and the stubbed simulator are really useful for unit tests. The emulator, however, is only available for x86 and ARM, and only a limited number of peripherals are available without additional stubbing.
In a stubbed simulator, you replace all the platform specific API with a PC implementation. Some APIs can be done generically, but very often you need to implement something specific for your test. For instance, when you stub a camera, you have to replace it with something that reads raw data from a file in the same format as the camera would provide. A stubbed simulator allows you to test your code before you even have the target platform available. It also gives you access to profiling and debugging tools that may not be available on the target platform.
As always with stubbing, the stubbed simulator replaces some code with stubs. The replaced code is not tested then. To work around this, there should always be some formal/automated testing on the target platform too, and the code that is not covered should be evaluated carefully with code inspection.
Debugging an embedded system is much more difficult than PC software. Therefore, as much as possible the debugging work should be done on the PC with the stubbed simulator or the high-level simulator.
Your basic tool is gdb. It has a gdbserver component, which allows you to run most of the debugger on a PC and set breakpoints, read registers and suchlike on the target platform via JTAG or serial. The target platform just needs some gdb stubs and a breakpoint interrupt handler, which is less than 1K of assembly. gdbserver also allows you to keep the debugging information on the PC (in an unstripped binary file) while the target platform has a stripped binary file. If you must have a graphical user interface to the debugger, you have the choice of ddd, eclipse, kdbg, kdevelop, insight, qdevelop, xxgdb, monodevelop and nemiver. All of these use gdb as a backend, so they also support gdbserver.
For debugging memory problems, valgrind is your friend. It only works on PC (though there is some work on support for ARM too). It performs fine-grain tracing of a program by instrumenting the binary and all shared libraries. The most often used tool is memcheck, which detects memory access violations (out of bounds references, reference after free, double free, …) and memory leaks. Another useful tool is massif, which tracks the evolution of dynamically allocated memory over time. This is useful to detect dynamically allocated memory that is not really leaked, but is referenced while it is no longer needed. With valgrind it is also relatively easy to implement new tools based on binary instrumentation.
strace is a very good debugging tool for exploring the interface between userspace and kernel space. It allows you to see where a process is stuck (e.g. waiting for input on a certain file descriptor), which files it opens, in which order it does locking (though that’s obfuscated by futexes). strace doesn’t work on all platforms.
For debugging network problems, you can use tcpdump or wireshark to get detailed information of the network traffic. Just attach a PC and the target platform to a hub (not a switch), and you can see all traffic. Hubs are limited to 10MB, if you need more you’ll need a PC with two NICs you can configure as bridge, or make a direct connection between PC and target, or run the tcpdump on the target itself. You can log to a file using ‘tcpdump -w’, and analyze it on your PC with wireshark. At a slightly higher level, a simple ifconfig already tells you how much traffic has passed over a NIC, and cat /proc/interrupts tells you how busy the NIC has been. This can give you a good indication as to where you should look further.
The best debugger around, however, is called printf. Tracing information shows you what the program is doing over time. You can also dump the contents of data structures in a much easier-to-parse format than what comes out of a debugger. You can add timestamps to the messages so you have a better idea of the order in which things happen. Also, in multithread programs it is less intrusive than inserting a breakpoint in one of the threads. And finally, you can of course combine it very well with the debugger. Therefore, it is a good idea to insert tracing statements immediately while coding. On the target platform, you can send output to the serial port or even activate some LEDs to perform tracing.
For debugging the kernel, consult Documentation/BUG-HUNTING. There are patches to attach a debugger to a running kernel, but you really don’t want to go down that path.
Before you start optimizing, you should first measure what needs to be optimized. Otherwise, you end up optimizing the wrong thing. Because of this, it also doesn’t make sense to start optimizing too early: the code is still going to change, so what seems important now may not be important in the whole application.
For speed optimizations, you need profiling tools. The first profiling tool is time, to measure how long an execution takes. strace -tt tells you how much time is spent in system calls. oprofile uses hardware counters to report the performance of the whole system. iotop can help to evaluate performance of IO to disks and such. wireshark or tcpdump can help evaluate network bottlenecks. stap (SystemTap) gives you control over what you want to trace in one or more of programs. gprof gives simple information about how often functions are called, but KCacheGrind (which is the user interface to valgrind’s callgrind and cachegrind tools) is a much more interesting fine-grain profiling tool. And finally, timed printfs in your source code are also a nice way to see how much time is spent on what. In the kernel, you can use CONFIG_PRINTK_TIME to attach timing information to the printk’s, and of course you can use the kprobes. Note that for the low-hanging fruit, it is sufficient to do profiling on the PC: the relative amount of time spent on different parts of the code does not depend that much on the platform.
For data memory optimizations, the tool of choice is valgrind’s massif. It traces malloc’ed memory over time and gives backtraces of the largest blocks. It can also trace stack depth, but that really slows down the program to a halt. For stack depth, a hack you can use is to limit the stack size using ulimit or pthread_attr_setstacksize: the stack overflow will cause a segmentation fault, at which point you can use the debugger to find the place that caused the stack overflow.
For code size optimizations, you can obviously just look at the size of the program. To get more detail, look at the size of object files, or extract a memory map by adding ‘-Wl,-Map,<filename>’ to the link flags. The memory map gives you the size of each function and global data object. nm on large object files also works. To reduce code size, there is the -Os option of gcc (but sometimes -O3 results in smaller code), and of course strip (removes debug symbols) or sstrip (removes a bit more). Big gains can be made with ‘–combine -fwhole-program’, but this is only applicable to small programs and not to shared libraries. There are also platform-specific compiler switches that can reduce code size (e.g. -mthumb on ARM). For C++, finally, you can use -fno-rtti if you don’t need dynamic casts, and -fno-exceptions if you don’t need exception handling. In C++ you should also be careful with templates: although templates are normally beneficial for code size because it is generic code, they may need to several instantiations of the templated function. Qt for instance takes special care by explicitly instantiating some functions, and avoiding templates altogether for others.
The size of the image can be reduced by reducing the size of individual packages. Compiler options and such are not really the way to go here (except maybe a global -Os) because you’d have to evaluate them for each individual package. It makes a lot more sense to strip out unneeded components from the individual packages. They usually have configuration options to select which features are enabled. Disabling features may also remove dependencies, which means also the package management information should be updated to remove those dependencies. Sometimes the default packagers keep debug symbols, documentation etc. lying around too, so it may be worthwhile to strip these.