The Technology Network

Tangled In The Threads
By Moshe Bar,
Dec 28, 1999 (7:26 AM)

Last month, we started a new series of Linux kernel internals. In that first part, we looked at how Linux manages processes and why in many ways Linux is better at creating and maintaining processes than many commercials Unixes. This series on Linux internals is by the way the fruit of a tight collaboration with some of the most experienced kernel hackers in the Linux project. Without the contribution of people like Andrea Arcangeli in Italy (VM contributor and SuSE employee), Ingo Molnar (scheduler contributor) and many others, this series wouldn't be possible. Many thanks to all of them, but especially to Andrea Arcangeli who has shown a lot of patience in answering many of my questions.

This time, we dwell a bit on the subject of scheduling. Much to my surprise, here again, Linux goes the unorthodox way and disregards conventional wisdom in kernel theory. The results are excellent. Let's see how.

Scheduling ClassesWhat kind of scheduling classes are there in Linux?

In Linux 2.2.x there are three classes of processes, as can be seen from the data definition for the scheduler:

1. cut and past from linux/include/linux/sched.h --

2. /* 3. Scheduling policies 4. */ 5. #define SCHED_OTHER 0 6. #define SCHED_FIFO 1 7. #define SCHED_RR 2

8. cut and past from linux/include/linux/sched.h --

SCHED_OTHER tasks are the normal user tasks (default).

tasks running in SCHED_FIFO will _never_ be pre-empted. They will leave the CPU _only_ for waiting sync kernel events or if an explicit sleep or reschedule is been requested from userspace.

tasks running in SCHED_RR are real time too, but they will leave the CPU if there is another real-time task in the runqueue. So the CPU power will be distributed between all SCHED_RR tasks.

If at least one real-time task is running, none SCHED_OTHER task will be able to run in any CPU.

Each RT task has an rt_priority so the SCHED_RR class will be allowed to distribute the CPU power between all the SCHED_RR tasks at will. The rt_priority of the SCHED_RR class works exactly like the normal priority field for of the SCHED_OTHER (default) class.

Only the root user can change the class of the current task via the sched_setscheduler syscall.

One of the tasks of a kernel is to make sure the system remains firmly under its control even in situations of misbehaving programs. One such misbehaving program might, for instance, fork() (that is create) too many processes too quickly and the kernel becomes so busy with itself it cannot cater to its other responsibilities. I found out Linux has no limit to how fast user-land programs can spawn children. HP-UX, Solaris, AIX all have a limit of one fork() per processor tick (called jiffie under Linux). The following patch will allow a maximum of one fork() per jiffie (one jiffie is usually 1/100 sec, except on the Alpha architecture where it is 1/1024)

--- 2.3.26/kernel/fork.c	Thu Oct 28 22:30:51 1999
+++ /tmp/fork.c	Tue Nov  9 01:34:36 1999
@@ -591,6 +591,14 @@
	int retval = -ENOMEM;
	struct task_struct *p;
+	static long last_fork;
+	while (time_after(last_fork+1, jiffies))
+	{
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(1);
+	}
+	last_fork = jiffies;

if (clone_flags & CLONE_PID) { /* This is only allowed from the boot up thread */

Threads, as we saw in the first part of this are necessary to give your process the possibility to make use of multiple processors. Linux doesn't really make any distinction between a process and a thread from a memory management and scheduling point of view. Some operating systems, like Solaris, manage threads within the user process by means of a thread scheduling library. The kernel only sees the process and doesn't know which thread -- if any -- is really executing inside the user process. This saves the kernel of managing lists with thousands of entries for each thread for each process. Obviously, the threads emulated on the top of one single user process won't be allowed to run concurrently on SMP. So the userspace approach won't scale very well on a SMP machine. Threading is strictly necessary only when all the threads will be CPU bound and not mainly I/O oriented. And if all the threads are CPU bound, then you definitely want to scale SMP-wise.

Using threads only to wait for events is overkill (having threads sleeping is a waste of resources and of performance). Almost all kernel subsystems in Linux (such as TCP/IP) offer async event registration. Using async event via the SIGIO signal, is a kind of irq driven handling, compared to a polled handling done using threads.

With the userspace approach you'll at least still avoid the TLB (translation look-aside buffer) flushing as all the threads will share the same memory view.

The advantage of having threads managed in userspace (through the threads library) is that the kernel will spend the scheduling CPU cost in userspace. It is true that in userspace you may choose to implement a very fast and unfair round robin scheduler that may cut down the scheduling cost, compared to the clever (but more expensive) kernel scheduler. Still it is probably not worth to use the userspace approach as the SMP-scaling issue is a big show stopper.

Speaking of SMP, as of Linux 2.4, I found there is no possibility to declare the processor affinity of any given userspace process. This is clearly a deficiency that needs to be addressed ASAP, although it is not clear to me right now where this could be done best. The scheduler could keep track of the CPU affinity declaration of a process or it could just determine a preferred CPU for a process by itself. Whatever the solution might be in the end, CPU affinity is certainly an urgent requirement, and I forwarded it to Linus Torvalds recently.

The 2.2.x SMP kernel scheduler has some bugs that make it sometimes less effective than the UP (i.e. UniProcessor) scheduler. Nevertheless, Andrea Arcangeli fixed all such bugs and rewrote the heuristics from scratch and the SMP scheduler gives an impressive SMP improvement under load (the SMP changes are just in the 2.3.x kernels and I plan to integrate it in 2.2.x too).

You can get the patch here which can be used against 2.2.11 and 2.2.12 and speeds up both kernels on SMP systems.

The patch was merged into 2.3.15, but not yet in 2.2.x because it's a performance issue only, but hopefully it will be soon. If you have an SMP machine, go get the patch and enjoy the performance gain and tell the world about it! Thank you, Andrea, for this nice hack.

I tested the patch on my quad Xeon VA Linux Systems machine in my labs and it was rock stable.

The SMP scheduler heuristic works in function of (without any particular order):

the idle CPUs the last CPU where the wakenup task was running the MM of the task (for optimizing kernel-threads reschedule) the goodness of the tasks running on the busy CPUs the time necessary to invalidate the L2 cache on the running CPU (cacheflush_time) the average timeslice (avg_slice) of the wakenup task (how much time the task is used to run before returning to sleep)

The algorithm collects the above data and chooses the best CPU where to reschedule the wakenup task. There are two paths involved in the Linux scheduler behaviour: 1. schedule() itself: the running/current tasks is a SCHED_OTHER task that expired its timeslice (so the kernel runs a schedule() while returning from the timer irq for switching to the next running task).

2. reschedule_idle(): a task got a wakeup (usually from an irq) and so we try to reschedule such wakenup task in the best CPU by invoking a schedule() on it (it's a kind of controlled schedule()).

Both 1. and 2. share the 'goodness()' function. The goodness() function can be considered the core of the SMP scheduler.

It calculates the goodness of a task in function of:

the task currently running the task that wants to run the current CPU

This is the source of the goodness function:

------- cut and paste from linux/kernel/sched.c ---------
* This is the function that decides how desirable 
a process is..
* You can weigh different processes against each 
other depending
* on what CPU they've run on lately etc to try to 
handle cache
* and TLB miss penalties.
* Return values:
*	 -1000: never select this
*	  0: out of time, recalculate counters 
(but it might still be
*		selected)
*	  +ve: "goodness" value (the larger, the better)
*	 +1000: realtime process, select this.

static inline int goodness(struct task_struct * p, int this_cpu, struct mm_struct *this_mm) { int weight;

/* * Realtime process, select the first one on the * runqueue (taking priorities within processes * into account). */ if (p->policy != SCHED_OTHER) { weight = 1000 + p->rt_priority; goto out; }

/* * Give the process a first-approximation goodness value * according to the number of clock-ticks it has left. * * Don't do any other calculations if the time slice is * over.. */ weight = p->counter; if (!weight) goto out; #ifdef __SMP__ /* Give a largish advantage to the same processor... */ /* (this is equivalent to penalizing other processors) */ if (p->processor == this_cpu) weight += PROC_CHANGE_PENALTY; #endif

/* .. and a slight advantage to the current MM */ if (p->mm == this_mm) weight += 1; weight += p->priority;

out: return weight; } ------- cut and paste from linux/kernel/sched.c ---------

A plain schedule() only works based on 'goodness()'. As you can see a plain schedule() is only SMP-aware. Notice that the goodness of the potential next task increases if its last CPU is the current CPU.

Nevertheless, reschedule_idle() is far more critical for CPU affinity and scheduler latencies (for example if you comment out reschedule_idle() the scheduler latency will become infinite. Also, reschedule_idle() takes care of the cacheflush time and of the task average timeslice and this is where the real interesting part of the SMP scheduler lies. In UP, reschedule_idle is not that interesting compared to the SMP version.

This is thereschedule_idle() implementation took from 2.3.26 (soon 2.4) :

------- cut and paste from linux/kernel/sched.c ---------
static void reschedule_idle(struct task_struct * p)
#ifdef __SMP__
	int this_cpu = smp_processor_id(), target_cpu;
	struct task_struct *tsk, *target_tsk;
	int cpu, best_cpu, i;
	unsigned long flags;

spin_lock_irqsave(&runqueue_lock, flags);

/* * shortcut if the woken up task's last CPU is * idle now. */ best_cpu = p->processor; target_tsk = idle_task(best_cpu); if (cpu_curr(best_cpu) == target_tsk) goto send_now;

target_tsk = NULL; for (i = 0; i if (target_tsk && p->avg_slice > cacheflush_time) goto send_now;

tsk = cpu_curr(best_cpu); if (preemption_goodness(tsk, p, best_cpu) > 0) target_tsk = tsk;

/* * found any suitable CPU? */ if (!target_tsk) goto out_no_target; send_now: target_cpu = target_tsk->processor; target_tsk->need_resched = 1; spin_unlock_irqrestore(&runqueue_lock, flags); /* * the APIC stuff can go outside of the lock because * it uses no task information, only CPU#. */ if (target_cpu != this_cpu) smp_send_reschedule(target_cpu); return; out_no_target: spin_unlock_irqrestore(&runqueue_lock, flags); return; #else /* UP */ int this_cpu = smp_processor_id(); struct task_struct *tsk;

tsk = cpu_curr(this_cpu); if (preemption_goodness(tsk, p, this_cpu) > 0) tsk->need_resched = 1; #endif } ------- cut and paste from linux/kernel/sched.c ---------

The only final objective of reschedule_idle is to call a schedule() on a CPU in order to reschedule the wakenup task in it. We use goodness in reschedule_idle() because we want to predict the effect of the future schedule() that we'll send to that CPU. By predicting the effect of the future schedule() we can choose the best CPU to reschedule at wakeup time. This of course saves us the trouble of executing on a CPU without the proper TLB settings.

If the CPU to reschedule is not the current one we send a reschedule event via inter CPU message passing (SMP-IPI interrupt on i386).

To make it very clear: `goodness()' is the core of the linux scheduler and it is SMP aware, while reschedule_idle() is the real core of the clever SMP heuristics. Linux can only do user pre-emption. Linus Torvalds, it seems, doesn't believe in kernel pre-emption. That's not as bad as it may look; all is fine for semaphores. Critical sections protected by semaphores can be pre-empted at any time, as every contention will end in a schedule() and there can't be any deadlock. But critical sections protected by fast spinlocks or by-hand locks,Cannot be pre-empted unless we block the timer irq. So all spinlocks should really be irq-safe.

Also, by avoiding kernel pre-emption the kernel become more robust and Simpler, since a lot of complicated code can be saved this way.

By the way, there are tools to monitor the scheduler latency to let the interested hacker catch potential code sections that need conditional schedules.

Linux Resources:

For more Linux features, news and content visit: Tom Henderson's Linux Line And elsewhere on CMPnet: InformationWeek's Linux Toolbox Planet IT's Linux Techcenter

Implications For Linux Linux due to its GPL nature lets us do things faster than others. For example: in some popular proprietary operating systems such as Solaris 7, for instance, a lot of code sections have to packaged within spin_lock() and spin_lock to make the same code work well in UP and SMP systems. While these commercial OS vendors tout this as a clear advantage for heavy SMP systems, these locks really bog down UP systems and simple SMP machines. This is because the same binary-only driver must work fine on both SMP and UP kernels. A spin_unlock is one locked asm instruction:

	#define spin_unlock_string \
	"lock ; btrl $0,%0"

and calling such clear-bit instruction through a function pointer is complete overkill.

Linux, on the other hand, has inline code sections for UP and SMP systems, adapting to the machine it is running on. Therefore, if only a Uniprocessor system is hosting the kernel, no time is lost locking a code section that is not to be locked at all.

Last, but not the least, the goodness function makes the SMP scheduler in Linux very clever. Having a clever SMP scheduler is critical for performance. If the scheduler is not SMP-aware, OS theory teaches, then the performance on an SMP machine can be even worse than on a UP machine (that's what is happening in 2.2.x without the new heuristics, but now this has been fixed).

That's it for this month. I hope you gained some deeper knowledge of how Linux schedules tasks on UP and SMP systems. There is obviously still tons of things to know, but to get a general overview this should be enough. For a really thorough understanding of the workings of the Linux kernel, wait for my book Linux Kernel Internals from McGraw-Hill, to be published early 2000.

Moshe Bar is an Israeli system administrator and OS researcher, who started learning Unix on a PDP-11 with AT&T Unix Release 6 back in 1981. He holds an M.Sc in computer science. Visit Moshe's website at

For more of Moshe's columns visit the Serving With Linux Index Page  

The Technology Network

Copyright 1998 CMP Media Inc.

Click Here!