User-Mode Thread Implementation
Introduction
For a long time, processes were the only unit of parallel computation.
There are two problems with this:
- processes are very expensive to create and dispatch, due
to the fact that each has its own virtual address space
and ownership of numerous system resources.
- each process operates in its own address space, and cannot
share in-memory resources with parallel processes
(this was before it was possible to map shared segments
into a process' address space).
The answer to these needs was to create threads. A thread ...
- is an independently schedulable unit of execution.
- runs within the address space of a process.
- has access to all of the system resources owned by that process.
- has its own general registers.
- has its own stack (within the owning process' address space).
Threads were added to Unix/Linux as an after-thought. In Unix, a process
could be scheduled for execution, and could create threads for
additional parallelism.
Windows NT is a newer operating system, and it was designed
with threads from the start ... and so the abstractions are cleaner:
- a process is a container for an address space and resources.
- a thread is the unit of scheduled execution.
A Simple Threads Library
When threads were first added to Linux they were entirely implemented
in a user-mode library, with no assistance from the operating system.
This is not merely historical trivia, but interesting as an examination
of what kinds of problems can and cannot be solved without operating
system assistance.
The basic model is:
- Each time a new thread was created:
- we allocate memory for a (fixed size) thread-private stack from the heap.
- we create a new thread descriptor that contains identification information,
scheduling information, and a pointer to the stack.
- we add the new thread to a ready queue.
- When a thread calls yield() or sleep() we save its
general registers (on its own stack), and then select the next thread
on the ready queue.
- To dispatch a new thread, we simply restore its saved registers
(including the stack pointer), and return from the call that
caused it to yield.
- If a thread called sleep() we would remove it from the
ready queue. When it was re-awakened, we would put it back onto
the ready queue.
- When a thread exited, we would free its stack and thread descriptor.
Eventually people wanted preemptive scheduling to ensure good interactive
response and prevent buggy threads from tieing up the application:
- Linux processes can schedule the delivery of (SIGALARM) timer
signals and register a handler for them.
- Before dispatching a thread, the we can schedule a SIGALARM
that will interrupt the thread if it runs too long.
- If a thread runs too long, the SIGALARM handler can
yield on behalf of that thread, saving
its state, moving on to the next thread in the ready queue.
But the addition of preemptive scheduling created new problems for
critical sections that required before-or-after, all-or-none serialization.
Fortunately Linux processes can temporarily block signals (much as it is
possible to temporarily disable an interrupt) via the sigprocmask(2)
system call.
Kernel implemented threads
There are two fundamental problems with implementing threads in a user mode
library:
- what happens when a system call blocks
If a user-mode thread
issues a system call that blocks (e.g. open or read),
the process is blocked until that operation completes. This means that
when a thread blocks, all threads (within that process) stop executing.
Since the threads
were implemented in user-mode, the operating system has no knowledge
of them, and cannot know that other threads (in that process) might still
be runnable.
- exploiting multi-processors
If the CPU has multiple execution cores, the operating system
can schedule processes on each to run in parallel. But if the
operating system is not aware that a process is comprised of
multiple threads, those threads cannot execute in parallel
on the available cores.
Both of these problems are solved if threads are implemented by
the operating system rather than by a user-mode thread library.
Performance Implications
If non-preemptive scheduling can be used, user-mode threads operating
in with a sleep/yield model are much more efficient than
doing context switches through the operating system.
There are, today, light weight thread implementations to
reap these benefits.
If preemptive scheduling is to be used, the costs of setting
alarms and servicing the signals may well be greater than the
cost of simply allowing the operating system to do the scheduling.
If the threads can run in parallel on a multi-processor, the added
throughput resulting from true parallel execution may be far greater
than the efficiency losses associated with more expensive context
switches through the operating system. Also, the operating system knows
which threads are part of the same process, and may be able to schedule them
to maximize cache-line sharing.
Like preemptive scheduling, the signal disabling and reenabling
for a user-mode mutex or condition variable implementation may
be more expensive than simply using the kernel-mode implementations.
But it may be possible to get the best of both worlds with a user-mode
implementation that uses an atomic instruction to attempt to get a
lock, but calls the operating system if that allocation fails
(somewhat like the futex(7) approach).