Linux Trace Toolkit is an old tool, this talk shows how to use it for monitoring rather than debugging/profiling.
With tracing, events can be en/disabled at runtime. This gives it very low overhead. Many tracing facilities exist in Linux: ptrace, systemtap, …
In LTTng 2.x what changed: unified the tracers, added a userspace tracer (which requires adding trace points to the application), put trace output in a unified format, reduced overhead, and ship in multiple distros.
Infrastructure: two tracers: lttng-modules (kernel tracing), lttng-ust (userspace tracing). lttng-tools: CLI for control, sessiond for registering traces, consumerd for capturing them and relayd for sending over network. Viewers: babeltrace, …
LTTng-UST: instrument with static tracepoint, dynamically link with lttng-ust library, this sets up a socket with sessiond and you can start a session from the command line. When no sessiond is found, the tracing does nothing. The data is communicated via shm, not socket. Therefore, if you kill the application, you don’t loose the last traces. sessiond has to be root to be able to trace the kernel. CLI has to be in lttng group to be allowed to control sessiond. Uses URCU for synchronization, so archs are limited to the ones supported by URCU. LTTng itself doesn’t have anything architecture-specific.
Stores the current trace buffers to disk or to network. This allows you to observe the evolution over time. The buffers are ring buffers (configurable size), so even within a snapshot you have a historical view.
You can register a core dump handler that takes a snapshot instead of dumping core, which gives you a lot more information, and can be saved over the network. You get info not just about the application but also about the kernel, and you get historical information.
You can trigger a snapshot on an alert from Nagios or Splunk, so that you can observe what’s really happening.
Analyse data without writing to disk. Viewers can attach to the relayd instead of reading the generated file. This allows you to take action immediately. Because traces can be streamed over the network, you can do analysis at the cluster level. This can be used for e.g. load balancing, or to create alerts from a log manager, …
Create a session in live mode that saves data every second. Trace a bunch of specific syscalls, and trace the relevant context e.g. PID and perf counters. You get a top-like interface, but you can go back in time and look at the states that a process has gone through.
Working on python bindings to extract traces so you can easily make your own analysis.
No details were given.
With 100 snapshots every 30 seconds, 700MB are collected. Compared to strace output: 6GB. Compared the performance with/without tracing by doing database requests and looking at how many requests can still be done while tracing happens. Up to 64 threads, no visible effects on performance. When writing to the same disk as the database, then there is a serious overhead, same order of magnitude as strace. So better use live recorder so no disk access is needed for tracing.
- Hardware tracing
- Trace triggers: take custom action
- Android port of UST – bionic is the limiting factor here
- Dynamic instrumentation: add traces to an existing application so you don’t need to add it to the source code.
Can you start tracing from boot? Not really, because you need a sessiond. For kernel users, ftrace is usually good enough. But you could write a kernel module that replaces sessiond.
[Other answers are included in the text above.]