IRQs: the Hard, the Soft, the Threaded and the Preemptible – Alison Chaiken

This presentation is based on personal experience of Alison with the problems caused by interrupts in a Real-Time kernel. A lot of her conclusions are based on actual testing and tracing.

The hard IRQ runs right away when the interrupt arrives. The soft IRQ / bottom half runs some time later.

First level of information about (hard) IRQs: /proc/interrupts.

IPI (interprocessor interrupts) are used mainly to coordinate work between CPUs. They are always software-generated.

ARM has a separate FIQ interrupt, which is not used by the kernel at the moment except in Androids fiq_debugger. In a secure ARM, FIQ can only be handled on the secure side (especially relevant for ARMv8, where Linux is not designed to run on secure side).

In RT kernels, interrupts are threaded (also when you put threadirqs=1 on commandline, but then no CPU affinity is possible). The resulting threads are called irq/NN-name. When interrupts are in threads they are manageable: they have priorities, CPU affinity, etc. Obviously the interrupt handlers also become interruptible with this. Interrupt threads normally get FIFO priority 50, but some secondary handlers (see below) have priority 49.

When a driver requests an IRQ, it can set a flag to always get a threaded IRQ. Threaded interrupts are interesting if they take a long time. Also they are allowed to sleep. There is still a hardirq handler function argument, that allows shared interrupts to immediately detect if the interrupt is for them. For RT, both become threads. The threaded IRQ has priority 49.

Some IRQs never become threads: timers, perf, … They are requested with IRQF_NO_THREAD.

Soft IRQs are a hodgepodge of more or less hardcoded things that are deferred: tasklets, kernel housekeeping (timer, sched, rcu), specific devices (net_tx, net_rx, block, block_iopoll). The Soft IRQ threads have a certain budget so there is still CPU time left for real threads. Soft IRQs are polled after a hard IRQ, and at regular intervals by the system management thread (ksoftirqd, which runs at a low priority). The soft IRQ class (tasklet, net_tx, timer, …) is first flagged, and is actually run by local_bh_enable(). So current() will alwyas be either a hardirq (or hardirq thread in RT) or a ksoftirqd.

In RT, the only softirqs that are run by local_bh_enable are the ones corresponding to the current hardirq thread. On a non-RT kernel, any local_bh_enable() will run all pending softirqs (until the budget runs out). In RT, softirqs can be nested

Tasklets are the generic device softirq. High-priority and normal priority variant.

“sched: RT throttling activated”: the ksoftirqd thread didn’t run.

Problem with ksoftirqd in RT: the ksoftirqd thread can get starved by interrupt threads and thereby missing timer userspace interrupts. Solution: split off the timer thread in ktimersoftd. This thread can be given a higher FIFO priority.

To investigate a function, you can use existing trace points, but if they don’t exist you can easily create a new kprobe based on the example code in samples/kprobes.

ftrace is the first tool to use for analysis, but it really generates a lot of output (which interferes with RT behaviour). eBPF combines the kprobes and uprobes features with possibility for filtering. eBPF filters can be compiled with BCC, which uses the clang rewriter to generate code but unfortunately that doesn’t have arm32 support yet. BCC has lots of examples that you can start from.

Example of tracing: NAPI polling. NAPI was introduced to deal with high-performance network interfaces, that can potentially generated thousands of interrupts per second which essentially brings down the system. NAPI will temporary disable the interrupt and instead go into polling. The kernel is anyway not going to look at those packets immediately (because other things have higher priority at that time), so it makes no sense to handle the interrupts. Even just drop packets completely if no time is left for the polling.

A NAPI interrupt handler disables the network interrupt and calls __napi_schedule(). The NAPI _rx function checks if the number of packets processed was smaller than the budget, and if so it enables interrupts again. Question: when does NAPI switch from interrupts to polling? Using BCC to test that, use the existing stackcount example on the e100_receive_skbs function. This shows which callers are calling the function how often. However, on non-RT kernels, it is possible that the softirq was raised from the hardirq but it only runs from ksoftirqd context. So instead, trace __raise_softirq_irqoff_ksoft, which is the real switch from interrupt to polling behaviour.

The main work of a RT engineer is to play with the priorities and CPU affinities of userspace threads and IRQ threads. This is mostly an artisinal task with a lot of testing. Generally IRQ threads should really have higher priority than the userspace threads, because they provide the actual RT input and output of the system. CPU affinity helps a lot to make sure that the random stuff stays away from the RT control loop. Ideally this should be automatable, by running tests with various priorities and affinities.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s