Testing Kernel GFX Drivers – Daniel Vetter

Daniel works for Intel and has worked on the i915 driver for a couple of years. i915 was the number 1 contributor to kernel regressions over the last dozen of releases. This had to be improved – so test were needed. Even the worst possible hacked together test suite is better than no test suite at all. They had to reinvent the wheel to some extent, because usually you really have to look at the screen to see if it works OK.

Daniel uses a rolling -next branch. Every two weeks a tag is sent off to QA; if that looks OK, it is passed upstream for integration in DRM -next.

Tests are now integral to the process. There are extensive userspace interface tests. Also during patch review, test cases are added. In addition to that, tests are added driven by bugs – so it’s not test-driver development but exactly the reverse: code something, see where it goes wrong, and write tests to cover those cases.

The test infrastructure uses piglit as a testrunner. piglit has a lot of GL test (coming from mesa) and wraps a lot of other tests (like X). An important feature of piglit is that it can skip tests, which is really important since some tests cannot possibly succeed on some hardware. To find errors, the dmesg output is used extensively so the source needs to be cleaned up to make sure the important stuff is WARN and the rest is INFO or DEBUG.

The test framework itself is homegrown (igt). The test cases are mostly binaries that are run and that exit with SUCCESS, SKIP or FAIL. It supports subtests (the program is executed separately for each subtest, it is run first to enumerate the list of subtests). The test statements (igt_require, igt_skip_on, igt_assert) use stjmp/longjmp to manage control flow – which means that stack variables are screwed up so debugging is a bit tricky. It also has fork helpers to create a bunch of children, capturing their failures and propagating them upward. igt_fork looks like an openMP loop: a block of code is executed in parallel in N instances. There are also exit handlers to do cleanup after every test, which makes the core test code easier to read because it is not cluttered by the cleanup stuff. Similarly, there is setup boilerplate for doing things like option parsing, sysfs handling, … .

Also inside the kernel infrastructure was added to facilitate testing. For instance, the modeset state checker or output routing state query checks in the hardware if what is really configured corresponds to what the software thinks is configured. The modeset state checker has uncovered a bunch of bugs, and every time that more state is added to the checker, new things pop up.

For testing the graphics stack, some special techniques are needed. For instance, signals are used quite heavily to continuously interrupt the test thread, which exposes scheduler hangs. For testing slowpaths, the test will supply its own buffer that is known not to be paged in, which will expose the slowpath; in addition, prefaulting has to be disabled.

To test the output, the hardware has CRCs of the frame data which can be verified in the tests. This is exposed through sysfs, but that’s actually quite complex because depending on the pipe setup you need to get it from a different place. Problem with CRCs is that it checks the entire frame, so you need a reference frame to compute a reference CRC. Alternatively, it is possible to create a reference frame and then re-create it in a different way, e.g. with the mouse cursor moved one pixel. If the CRCs are still the same, then there is an off-by-one error in the kernel. With the CRCs, a number of tests have been added for things that were known to be broken, but now they can be proven to be fixed (when they will be fixed).

GPU hangman fakes GPU hangs by stopping submitting commands. The kernel should then detect that the GPU hangs and reset it. The tests that were added exposed a lot of deadlocks which are now fixed. Currently, the driver detects hangs pretty well and unless you check the dmesg you probably won’t notice anything. GPU hangs are sometimes caused by kernel bugs but often by userspace bugs – the dmesg explains to the user how to file a bug report at freedesktop.

For the future, Daniel would like to inject things like EDIDs so it is possible to test support for different types of displays without having the screen. Currently, there are a lot of code paths that are completely untested because of lack of hardware. Same for multipipe configurations (avoiding to need to attach 3-4 displays).

All the testing is pretty intel-only, ideally the other DRM drivers would reuse and extend the test suite. In particular, all the igt infrastructure is certainly reusable.

Writing test cases shows that you really understand the bug. It has happened to Daniel that he thought he understood a bug and how to fix it, then wrote a test and it turns out to work differently from what he thought.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s