CERN generates 5PB/y of data from the LHC It is processed by CMSSW a large diverse application set with a python configuration language, 600 libraries, 5M SLOC, 2GB RSS in a typical run. Computation spread out over 170 sites, 350K cores.
First problem in any SW and especially CMSSW is in memory handling, memory leaks. Valgrind wasn’t finished in 2003 so they wrote MemProfLib that inserts a malloc hook to generate xml output. It turned out some ignominious bugs, hence IgProf.
- Performance & memory profiling
- No kernel support or root privileges
- Low overhead
IgHook rewrites the object code and inserts hooks into mallocs or uses SIGPROF/SIGALRM for performance. Instrumentation is done by replacing the initial instructions of a function with a jump to a trampoline (that copies the original code and jumps to the instrumentation function). This is not simple, due to things like relative jumps, complexity of x86 assembly. Fortunately, only specific functions need to be instrumented and it easier to make it work for these.
Memory profiling is similar to massif. To find leaks, explicit dumps are inserted at points where (almost) all memory is supposed to be freed, e.g. when processing one event is done. Can also generate a graphical heap dump to detect fragmentation.
The same concept can also be applied to other resources: read/write, exceptions. Also a tool that poisons memory to detect allocated memory that is actually not needed (e.g. too large IO buffer).
Performance profiler just signals SIGALRM every 10ms and records the backtrace. Turns out that most of the time is spent (de)allocating memory. But otherwise there is no real hotspot: most of the time is spent in functions that individually take less than .5% of the time.
Backtraces are generated with libunwind (backtrace was not reliable enough). But calling a full unwind was too expensive. So they implemented and upstreamed a fast path with a simple stack walk.
PAPI (Performance API) allows you to read hardware counters. IgProf can (optionally) use these to augment the SIGPROF-based performance measuring.
Output is generated with a postprocessing tool to either text or web interface. In heavily templated C++ code, it is necessary to do symbol renaming to merge similar call paths. Similarly you can merge by libraries, which really is useful when there are 600 of them. They grouped some of the heavy libraries together and did LTO on them, which gave them a good boost.