Different types of power management.
Simple on/off, but when off you can’t do anything. Suspended and runtime idle are inbetween states that still don’t allow you to do any work but that allow you to resume more quickly. Runtime idle is opportunistic, i.e. automatic. Both for d-states (devices) and c-states (CPU). Suspend is userspace initiated. User tasks are frozen, devices forced into idle. S-state (system state). Could be just CPU idle state as well, but since tasks are frozen, you get less wakeups.
Active power management will reduce power consumption without turning off, i.e. you continue working. = DVFS = P-states, governed by cpufreq. Also device active power management, e.g. PCIe ASPM. GPU is completely divorced from the rest of the system.
P-states != frequency (also voltage); P-states != power (also temp governs power, so we don’t really control power).
How to choose the right p-state: governor. Ultimately a user decision. intel_pstate driver is also a governor with just powersave and performance; acpi_cpufreq driver also works on x86 and has descrite governor so more governor points.
Race to idle: just stay in performance state, do things as quickly as possible, and rely on runtime idle to save power. This only works if runtime idle indeed consumes significantly less.
Powersave attempts to balance performance with energy saving. It looks at CPU utilization (= load at this specific p-state, will probably go down when going to a higher p-state) and capacity (= maximum performance in highest p-state) to determine whether to increase or decrease p-state.
Intel p-states: depending on the number of cores that are active, the power budget is redistributed over the active cores so they can use it to increase the maximum allowed frequency (= “Turbo mode”). Under the lowest frequency, there is still a “Thermal zone” where your CPU will be pushed to exceptionally low frequencies when it’s overheating. Recent (Haswell) processors have an integrated voltage regulator so each core can get a separate voltage (otherwise changing frequencies differently for different cores have multiple impact).
Hardware coordination of p-states: OS asks for a frequency, but the hardware will decide itself between all the cores which one will be selected. The OS should therefore check afterwards with the hardware what the real frequency is. intel_pstate does this, but acpi_cpufreq just reports the (wrong) requested frequency.
Looking at capacity and utilization isn’t always the right metric, e.g. when waiting for something a higher p-state will just wait faster. Also the sampling rate is a limiting factor, you may miss interesting events. It’s also not always clear if scaling is worth it. There’s a big risk of jitter when the CPU is interacting with another element (e.g. the GPU) and intermittently waiting: when it goes back to having work to do, it will take a long time to go back to high p-state. Hardware support can solve these issues, because it can sample at high rate and it can look at GPU and PCI transactions. = Intel Speed Shift Technology = HWP. You get the highest and lowest frequency, but also a guaranteed frequency and most efficient frequency. The latter is calculated based on temperature, race-to-idle information, HW counters to evaluate benefit, this gives Pe = most efficient frequency. The OS provides a Pa value = how aggressive it should be. The algorithm operates between Pe and Pa. It’s also possible to actually set the frequency between Pe and Pa from software, but then of course the benefit is largely gone. You can control the minimum and maximum pstate from userspace.
Side track: idle injection. To control thermal issues, it’s not enough to go the the lowest frequency, since that may actually be less energy-efficient. Instead, you should inject idle time so race-to-idle still works.