This talk has a lot of information in the slides, so refer to them.
The ftrace infrastructure is really thinking out of the box. And when you do that, you have to think really carefully about how it affects corner cases.
ftrace works by compiling with -pg, which adds a call mcount at the beginning of each function. mcount is a weak symbol in libgcc, which the kernel overrides. It’s a trampoline so no register spilling etc. just a jump. By default, mcount just returns. However, just adding that already causes 13% overhead. So the mcount calls are replaced by nops at compile time. recordmcount reads the object files, finds the mcount calls in the relocation tables, and adds a new section with those addresses. The linker links all those sections together, and then the locations are replaced with NOPS. The mcount_loc section is thrown away.
To enable the tracing, you need those addresses again but the mcount_loc section doesn’t have enough information. So instead there’s a struct dyn_ftrace with all this information. This struct has some generic info and an architecture-specific substructure. Before deleting mcount_loc, it is copied, sorted by address, filled into dyn_ftrace structs, and saved in the ftrace_pages variable. This is used to implement the calls in /sys/kernel/debug/tracing.
When turning tracing on, the functions that match the filter get the nop replaced by a call to ftrace_caller. Live chaning of code has to be done carefully, especially on SMP systems. A single instruction may cross a cache line boundary, so the other processor’s cache can’t be updated atomically. Therefore, it must be done with a breakpoint. A breakpoint is set by modifying one byte. The breakpoint handler knows that this address is an ftrace thingy so it just jumps over it. Then the instruction is replaced with the call to ftrace_caller with the breakpoint still set, and finally the breakpoint is removed.
The ftrace callbacks are registered with register_ftrace_function(). Some of them are static, some are added dynamically. The ftrace_caller function is also patched live to jump to the trace function that is selected; when tracing finishes, it is replaced back with a call to ftrace_stub. If there is more than one trace function to execute, the callback is list_func which iterates over them. But the different trace functions may have different filters. Therefore, 3.18 (or later) will introduce dynamic trampolines, which is used instead of the normal ftrace_caller function so that the list_func no longer has to do the filtering. However, there’s a problem with freeing the dynamic tracer, because you don’t know when there are no more users. RCU comes to the rescue because it has the same problem. [Check out the non-existent video for the real explanation – you’ll have to make do with the slides :-)]
In the mcount function, we don’t have access to the parameters etc. because it comes too late. gcc 4.6 added an fentry option which puts an __fentry__ call instead of an mcount call and puts it at the very beginning of the function, so there we can look at the arguments etc. This is a powerful tool because now you can do things like changing the return address in your tracer.