Using SCHED_DEADLINE – Steven Rostedt, Red Hat

Different scheduling classes:

  • SCHED_OTHER: fair scheduling, uses nice value
  • SCHED_FIFO: the usual RT scheduler. Fixed priorities, tasks with the same priority get first come, first serve.
  • SCHED_RR: just there for POSIX reasons, it’s useless, it’s not really deterministic. E.g. if 3 tasks with the same RR priority are spread out over 2 CPUs, then one task will get 100% of 1 CPU and the other two will get 50% of the other CPU.

Example of SCHED_DEADLINE: imagine a controller for a nuclear power plant that needs .5s every 1s, a washing machine that requires 50ms every 200ms. With SCHED_FIFO, you should give the washing machine higher priority. If you give the power plant higher priority, the washing machine will certainly miss its deadline during the .5s that the power plant is busy. OTOH if the washing machine has highest priority, both processes will get their deadlines. This is Rate Monotonic Scheduling. However, with RMS, the maximum allowed utilisation is way less than 100%, to allow all possible orderings for when the tasks start. For infinite number of tasks, the maximum utilisation converges to 70%. See the slides or video for a nice graphical example.

Earliest Deadline First (EDF) scheduling on the other hand is guaranteed to meet deadlines as long as utilisation is < 100%.

For implementing SCHED_DEADLINE, new syscalls had to be added: sched_getattr and sched_setattr. The syscalls have a size and a flags parameter for future extensions. The sched_attr struct has the usual sched fields like sched_policy, and also sched_runtime, sched_deadline and sched_period. These fields are in nanoseconds. However, if better resolution than 1ms is needed, you need to turn on SCHED_HRTIMERS in /proc/sys. sched_setattr will fail if the deadlines of everything together can’t be met.

sched_yield(): most use cases are buggy. In SCHED_OTHER it just gives up the current CPU time slice. In SCHED_OTHER/SCHED_RR, it allows a task of the same priority to run. E.g.  it doesn’t work to use sched_yield() in a pthread_mutex_trylock() loop to break deadlocks when taking two locks. In SCHED_DEADLINE, however, sched_yield() has a use: it tells the kernel that the current period is done so the next deadline process can get the CPU. The sched_yield in this case will sleep until the period set in sched_attr expires.

If the deadline is not equal to the period (which is often the case), then the maximum CPU utilisation should consider only the deadlines, because multiple tasks may trigger at the same time and need to satisfy the same deadline.

In multiprocessors, EDF can only guarantee deadlines if the total utilisation is smaller than 1, i.e. it doesn’t really use the multiplce CPUs. That’s because all the earliest deadline tasks may occupy all the CPUs for a relatively short time, but which leaves insufficient time for a longer-running task. It’s similar to the different periods.

One solution is to partition tasks over the CPUs and lock them on a particular (set of) CPUs. However, optimal partitioning is an NP-complete problem so not usable for a scheduler.

However, special cases can be guaranteed with global EDF, by looking at the maximum per-task utilisation. That’s why sched_runtime is there, to verify the global utilisation.

Limits of SCHED_DEADLINE: You can’t specify partial affinity otherwise the global scheduling no longer works. You should account for migration overheads in the runtime. You can’t fork because the schedule of the tasks has been fixed. You have to specify the absolute worst case execution time for the runtime, which can be very much worse than the typical execution time.

To use affinities with SCHED_DEADLINE, you have to create two mutually exclusive cpusets which internally have load balancing and then turn off global load balancing. Steven has some C code that does this, but it’s not nice.

On the way to mainline: Greedy Reclaim of Unused Bandwidth: if there is still unused CPU time, SCHED_DEADLINE tasks can use it even if they are already above their runtime limit. This allows for WCETs that have occasional spikes above sched_runtime.

SCHED_DEADLINE is also a way to control the CPU bandwidth of a process: when the budget runs out, the process gets throttled until its period comes up again – and this is guaranteed up-front when the task is started.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s