Software crash analysis (Yves Martens, Wim Decroix)

Goal: make sure crashes don’t happen => analyse crashes, but spend less time per crash.

Split application architecture (SPACE): applications are isolated in dedicated processes. Resources are managed explicitly.

Software size and number of crashes go hand in hand. Growing software size means more time spent on crash analysis. Main problem: when there is a crash (in QA testing or in the field), it’s a long way before you even find out what is causing the crash and you can start fixing it. Desired flow is that after a crash you can identify immediately the root cause (who is the owner, avoid spending time on reproducing it).

  • Detect crashes in software
  • Dump as much info as possible (no reproducing necessary)
  • Analyse and visualise
  • Improve tooling for this process

Software crashes

  • SIGSEGV: install signal handler that makes a dump: stacktrace of thread, dumped to flash. Analysis tools translates backtrace to user-readable data. TPV has a proprietary solution because of lack of availability a long time ago. It uses a backtrace in the kernel, which allows viewing the kernel stack as well and even backtrace other processes. Based on backtrace, QA person can file bug to the right person based on the affected subsystem.
  • Watchdog: detect unresponsive worker threads. All tasks must be finished in a certain amount of time in SPACE (= resource allocation). If it takes too much time, this is a crash and the TV is restarted. Just raise a signal to that thread, which is handled just like SIGSEGV.
  • Watchdog can also detect dead/livelock in worker threads. In this case all the threads are backtraced from kernel space. Need global info to find out who is the culprit. Traces are kept in kernel ringbuffer Visualization is done with timedoctor (sf.net project).
  • CPU overload: time out because other tasks are eating all CPU. cfr. deadlock. Visualization is essential to identify such patterns.
  • Out of memory: OOM killer would also kill forensic evidence. But for embedded systems the OOM condition would anyway be fatal. Therefore, heuristically poll the free memory and crash before going OOM. Dump kpagemap to have an idea who is consuming (physical) memory, proprietary tool to parse it. Shared libraries are accounted separately.

This is also collected in the field, and serious crashes are reported back to the developers.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s