Tools exist for debugging/profiling individual parts of an application, but this talk is about debugging/profiling the entire application, which may consist of multiple executables or multiple machines.
Breeze (by Ellexus) is a proprietary tool to debug scripted flows. It shows which programs are called, which files are read/written, which network connections are opened. This is shown graphically (dot). Typical use: compare two applications, e.g. on a system that works and on a system that doesn’t work. So it has a diff viewer. This takes the guesswork out of figuring out what is wrong.
Breeze also includes profiling of file I/O which helps to match the physical resources with what the application needs.
gdb and valgrind are powerful debugging tools but they don’t give an overview, they work with individual instructions. They’re only really usable during development, not with applications running in the wild. Also their presence influences the behaviour.
UNDO is a tool that takes snapshots, so you can go back in time.
Alinea DDT can debug parallel code across many machines.
ptrace (strace/ltrace): trace system/library calls. Generates lots of output, which can be filtered. They allow debugging at application level, but difficult to wade through all the output.
Breeze uses LD_PRELOAD instead of ptrace. More powerful that ptrace because you can actually change the behaviour of the functions. This technique is also used in e.g. fakeroot, fakechroot, faketime, pseudo (the yocto version of fakeroot). Breeze must come before these others, but e.g. pseudo should not know that it’s not the first one.
Cloud computing and virtualization are very nice, but when something goes wrong it’s much harder to pinpoint where it goes wrong. E.g. in a cluster, the hardware will not be homogeneous because you have some older machines in parallel with some newer machines, and this may lead to the application behaving differently on different servers (e.g. exposing a race condition).
Case studies of what can go wrong when there is too much abstraction.
- ARM CAD filesystem: very heterogeneous, e.g. some directories are extremely backed up, others are extremely low-latency. CAD flow is using 3rd party applications which are used in a scripted flow developed by ARM. IO Profiling shows that the right directories are not used. E.g. using /scratch quite heavily while only the results that should be persistent should be stored there. This shows that abstraction doesn’t always work, because it hides the performance bottlenecks.
- Network IO: very short-lived runs of the server application. Half of the connections to the license server turn out to fail; delaying it gives a large performance gain.
- Terminal server: user couldn’t type anything. Breeze trace shows that the wrong keybindings were being loaded (due to a mistake in the user’s config files).
- Distributed applications run on different systems which may not have a lot of common factors, e.g. dedicated database server and dedicated JVM. How can you trace the behaviour of the database back to what the user is doing, or vice versa?
- What should we measure? Hard to combine different datasets, e.g. memory, IO, scheduler, … We don’t want to measure everything because that’s going to kill performance.
Solution: design for profiling. When building code, build in observability. Therefore, profiling framework should be chosen at the start. You can create a custom profiler for your application if you architect it in.
Big data: try to monitor everything and correlate. This would allow you to find out when drops in performance take place, and also predict resource needs.
Question from the audience (actually not a question): The problems mentioned here are covered already by LTTng, at least when it comes to extraction. What can certainly be added on top of LTTng is analysis.
Q: don’t the security mechanisms prevent LD_PRELOAD from working? A: yes, they’ve found ways to circumvent it, but they’re not going to tell how.
Q: How can you map the things that are traced back to the code. A: visualization is an equally important subject.