What Is Multi-Threading?

by Martin McCarthy

Perhaps one of the reasons you use Linux is because it is a multi-tasking operating system. In which case you are probably well aware of the utility of being able to do more than one thing at a time—perhaps compiling the utility which will bring you fame and fortune whilst editing a letter of resignation to your boss. So you might recognise the utility of an individual process that can do more than one thing at once.

When might this be a good idea? One application for multi-threading is a program which relies on a large number of very similar and independent mathematical operations—the oft-quoted example of this is matrix multiplication. We look at a simple example of a multi-threaded matrix multiplier later in this article.

Another type of application which can benefit from multi-threading is a server application. Whenever a client attempts to connect to the server, a new thread can be created to look after that client whilst the “watcher” thread continues to wait for more clients to connect.

“But wait!” I hear you cry. “Lots of server applications already work like that, but they simply fork another process rather than starting another thread.”

“You're right...” I reply.

“Don't interrupt,” you continue. “This sounds like another way of doing the same thing, but for no good reason other than to say it's a `multi-threaded application' and so you can bump up the price to those who like to be blinded by science.”

At this point I decide to ignore any more of your interjections and just explain.

Yes, creating a new thread is very similar to forking a new process, but there are differences. When a new process is forked, it shares relatively little data with the parent process which created it; when a new thread is created, it shares much more information (such as all the global variables and static local variables, the open files, and the process ID). The overhead of creating separate copies of everything makes the creation of a new process slower than the creation of a new thread. And the time it takes to pass control from one process to another (a context switch) is longer than the time required to pass control from one thread to another. Also, since the threads share their data space, passing information from one thread to another is much more straightforward than passing information from one process to another—while this does have its own problems, it need not be difficult if a little care is taken. A simple example of a multi-threaded server is given below.

A third type of application which could benefit from multi-threading is a Doom-like game where each of the computer-controlled baddies has its own “intelligence” and can act independently. The main game-play part of the program could be in its own thread (or multiple threads), there could be another thread handling each of the characters in the game who are controlled by real people, and yet more threads for each of the characters controlled by the computer.

POSIX Threads

There is a POSIX standard for multi-threading: 1003.1c. If you want to write portable programs (which doesn't seem like a bad idea to me), it would be wise to stick to this standard rather than use non-POSIX conforming libraries or using the raw system calls which Linus so kindly provided in the Linux kernel (about which I say just a little more later).

The examples in this article use POSIX multi-threading functions called from C. However, the concepts in some non-POSIX libraries and systems are often very similar. Solaris threads are easily understood by someone familiar with POSIX threads, and while Java threads and the multi-threading in the Win32 and OS/2 APIs are a little different, there should be little in these which would leave you quaking in your boots.

Linux Threading: Linus Gave Us clone()

Multi-threading capability is included in the version 2.0 Linux kernel (and many version 1.3 kernels). The clone() system call creates a new context of execution, or “COE” (to use Linus' term), which can be a new process, a new thread, or one of a range of possibilities which doesn't fit into either of these categories. The fork() system call is actually a call to clone() with particular values as parameters, and the pthread\_create() function call could be a call to clone() with a different set of values as parameters. clone() is used by some implementations of multi-threading libraries to provide kernel threads, but further discussion of clone() is beyond the scope of this article.

User-Level Threads Versus Kernel-Level Threads

Threading libraries can be implemented using kernel features for creation and scheduling (such as clone()), or entirely in user space where code in the library handles the creation of threads and the scheduling between threads within a process. In general, user-level thread libraries will be faster than kernel-thread libraries. On the other hand, kernel-thread libraries are likely to make better use of kernel facilities—if the kernel makes better use of multiple processors in the next release, so might all your multi-threaded programs. However, the programmer shouldn't need to worry about whether the library is implemented as a user-level library, a kernel-level library, or a combination of the two; the operation of the program should be essentially the same.

Available Libraries

There are a many POSIX threads libraries available for Linux, and some make no attempt to be POSIX-compliant. Also, some of the POSIX threads libraries are compliant with an early draft of the standard or just part of the standard. Anything which isn't compliant with the final POSIX standard may show different behaviour in some circumstances or have slightly different function calls. That is not to say these aren't good multi-threading libraries, but you should be aware of what you're using.

All examples in this article were tested with Xavier Leroy's release 0.3 (ALPHA) of LinuxThreads, covered by the GNU Library General Public License. This library is still being developed and aims to be fully POSIX-compliant sometime in the future. It is clone()-based, and so it has the advantages and disadvantages of a kernel-level implementation.

Don't Use Global Data

As previously noted, multi-threaded programming isn't really very difficult. However, there are ways to make life difficult for yourself.

Guides to good programming often say to avoid using global data. Maybe you've never seen the point of this—particularly if you're a careful programmer and you know to look after your global data. With multi-threading, there is another reason to avoid using global data. Consider the trivial code fragments in Listing 2.

When this function is called you would expect to see the numbers “0 1 2 3” printed. And indeed, this is what would happen. If you were to call this function from two different threads you might expect to see two sets of the numbers printed, one from each thread. If the two threads were to call this function at about the same moment you might expect to see the two sets of numbers interlaced, perhaps something like:

0 1 0 1 2 2 3
3

Here, thread A calls fn(), which starts to print the numbers. When thread A is only as far as “1”, thread B starts up and calls fn(), which manages to get as far as “2” before thread A gets control again. The call to fn() completes for thread A and thread B gets control again and finishes off.

However, this is not what would happen. Let me stress this is NOT what would happen. Instead, the output might be:

0 1 0 1 2 3
4 5 6 7 8 9 10 11 12 13 14...

What is happening is thread A calls fn() which prints the numbers up to “1”. Then thread B starts up and calls fn(). As the for loop is entered, the value of i is set to 0. Remember i is global, so it is shared by both threads—when thread B changes the value of i, thread A will see the same change. The counter reaches the value “2”, at which point thread A is given control again. When it attempts to increment i, it now reaches “3” and then “4”, at which point thread A does not print the value of i and is ready to exit the function. When thread B gets control again, it prints the current value of i (which is “4”, courtesy of thread A), increments it to “5”, then does the comparison i!=4. The comparison doesn't fail (and will not fail for a long time—not until i has reached the maximum int value and looped around again), so the loop continues printing numbers.

This sounds horrible. It is horrible! And it can be avoided by not using global data. If i was declared local to fn(), each thread would have its own copy and you would get the expected output.

How To Use Global Data

Sometimes you may find you need to use global data, either because you are adding multi-threading features to an existing program which cannot be rewritten, or because you believe global data is the correct thing to use. Notice, in this context, I'm saying “global data” when what I really mean is shared data--the data may not be global to the program, but it is global to (or shared between) one or more threads. In a conventional, single-threaded program static local variables can be a useful convenience; however, these are effectively global variables and if you don't take care with them, they will cause you problems with bugs that can be very hard to track down. So what do you do with global data? All you have to do is make sure no other thread can change the value of your global data between the time you set a value to that data and the time you use the value. POSIX threads provides a number of ways of performing this synchronisation between threads. The simplest are mutual exclusion locks, or mutexes. A thread locks a mutex at the start of a section of code, and unlocks it at the end of that section of code. A thread which holds a lock on a mutex is said to own that mutex. While one thread owns that mutex, no other thread will be able to execute that same section of code.

Consider our previous example with the runaway loop. We can rewrite this so that it still uses global data, but with mutexes to prevent two threads modifying the same loop counter at the same time.

In Listing 3, loopLock is a mutex variable. Mutex variables should always be initialised before use, either by a call to the initialisation function pthread\_mutex\_init(), which can be used to set values for attributes which are particular to that mutex variable, or by using the standard macro PTHREAD\_MUTEX\_INITIALIZER which uses default values. These attributes can affect the priority and scheduling of the thread which owns the mutex, or dictate which threads can operate on the mutex (either those within the same process or any threads which can access the memory where the mutex resides). We make use of only default mutex attributes here.

When a thread makes a call to pthread\_mutex\_lock(), the mutex variable it tries to own will be either free or owned by another thread. (If a thread makes an attempt to lock a mutex which it already owns, the result is undefined. Don't do it!) If the mutex is free, the thread will take ownership of that mutex and pthread\_mutex\_lock() will return. If another thread owns that mutex, the call will wait until the mutex comes free and can claim ownership for its own calling thread. The thread then owns that mutex until it makes a call to pthread\_mutex\_unlock(). (Again, if a thread makes an attempt to unlock a mutex which it doesn't already own, the result is undefined.)

As a result, if more than one thread tries to execute the code in the function at the same time, then only the first will own the lock and enter the loop. Any other threads will sit at the pthread\_mutex\_lock() call until the first thread has exited the loop and unlocked the mutex. Then ownership will be given to just one of the waiting threads which will be able to enter the loop safely.

So if there are two threads which call this function at the same time, the output will be:

0 1 2 3
0 1 2 3

As expected, each pass through the loop completes before the next starts and the output is not interlaced.

Remember... Don't Use Global Data

Of course, this problem has a much more elegant solution: avoid using the global data in the first place. In Listing 4, the loop counter is local to the function and there is no conflict; each call to the function has its own version of i, and so each thread which calls that function will have its own version of i.

However, two (or more) different threads can be inside the loop at the same time and so the output can be interlaced, as we originally expected. For example:

0 1 0 1 2 2 3
3

If it is not appropriate for your application to have more than one thread within a certain piece of code at the same time (this piece of code is commonly referred to as a critical section), you may still wish to use mutexes, even though you do not have shared data.

Remember, though, critical sections can negate some of the advantages of multi-threading by forcing threads to wait for others. It is a good idea to keep any necessary critical sections as short as possible.

Thread-Safe libc

What the programmer may need to be aware of is whether the other libraries being used are thread-safe or not. Otherwise, problems such as those described above may (or, more likely, will) occur. A good example is with every C and C++ programmer's friend, errno. When a system call fails for some reason, it is common for the function called to return a generic failure value (such as NULL or -1) and an indication of why it failed in errno--which is, of course, a global variable. Just think of the confusion if one thread calls putenv(), which fails and sets errno to ENOMEM. Then there is a context switch and another thread calls open() and fails, setting errno to ELOOP. Then there is another context switch; the return value from putenv() is checked, the failure is spotted, and confusion breaks out because it looks like putenv() failed with ELOOP, which is not possible.

Or consider the standard function strtok(). This breaks a string into tokens and works in two ways. On the first call, you pass in a string which is to be tokenised. It stores this string in a static area. On subsequent calls you pass in a NULL string and strtok() works its way through the already stored string. If two threads call strtok() concurrently, the first one will have its stored string overwritten by the string that the second tells strtok() to store.

Fortunately, thread-safe versions of errno and thread-safe versions of standard functions within libraries are now available. Take a look at /usr/include/errno.h, for example. If you have a recent set of include files you may find the following bit of code:

#if defined(_POSIX_THREAD_SAFE_FUNCTIONS) || \
defined(_REENTRANT)
extern int*     __errno_location  __P((void));
#define errno   (*__errno_location ())
#else
extern int errno;
#endif

errno is redefined to be an int returned by a function, rather than the usual global variable, when either \_POSIX\_THREAD\_SAFE\_FUNCTIONS or \_REENTRANT has been defined.

When compiling code for multi-threaded programs, always use:

#define _REENTRANT

in your code (or use the compiler option -D\_REENTRANT to make use of this thread-safe macro for errno).

Similarly, recent versions of libc contain thread-safe versions of standard unsafe functions. So there is now the usual unsafe strtok(), but also a thread-safe function strtok\_r()--see your man pages for the details.

Creating a Thread

All this, and we still haven't created a thread! Okay, let's put that right. Listing 5 is a very trivial, and complete, example of a multi-threaded program.

When I compile (with cc helloworld.c -o helloworld -lpthread) and run this program, I get the output: Hello world world Hello Hello world world Hello Hello world world Hello Hello world world Hello Hello world world Hello

Let's take a look at this program. There is a mutex variable printfLock; this may not be strictly necessary, but it is included just in case we are not using a thread-safe version of libc—just to be safe, mutex locking is put around the call to printf().

There are two new function calls in main(): pthread\_create() and pthread\_join(). The first of these, not surprisingly, creates a new thread. The first parameter points to a variable of type pthread\_t; this can be used to identify the newly created thread, not unlike the file-handle returned from the fopen() call. The second parameter allows you to set various attributes for the new thread. If the value is NULL, default attributes are used, and this is fine for now. The third parameter is the function which the thread is to execute; this is always a function which takes a void* as a parameter and has a void* return value. The fourth parameter is the argument which is to be passed as a parameter to the thread function defined in parameter 3.

As soon as pthread\_create() has successfully created a new thread, the function which is passed as the third parameter will be running.

In this example, the thread function takes a string as an argument and prints that string ten times before exiting. As you can see from the output, both threads execute at the same time and their output is interleaved.

The pthread\_join() function waits for the specified thread (identified using the pthread\_t variable returned in parameter one of the thread creation) to exit. A thread exits when the thread function returns, or when the thread makes an explicit call to the function pthread\_exit().

Matrix Multiplication

Listing 6 shows a simple program which will perform matrix multiplication on a pair of fixed-size square matrices, with a separate thread performing the calculations for each column of the resulting matrix.

As a quick refresher, if it has been a while since you did matrix mathematics, the result of multiplying two matrices is given by the program in Listing 1. Running the program produces the output seen in Figure 1.

Figure 1. Matrix Manipulation Results
|  9  8  7  6  |   |  1   2   3   4  |   |  150  180  210  240 |
|  6  6  4  3  |   |  4   5   6   7  |   |  84  102  120  138  |
|  3  2  1  0  | x |  7   8   9  10  | = |  18   24   30   36  |
|  0 -1 -2 -3  |   | 10  11  12  13  |   |  -48 -54  -60  -66  |

There are a number of things worth noting about this program.

First of all, you will probably get a speed increase using a multi-threaded program like this only if you have a system with more than one processor. Why? Because with a single-processor system the one processor must perform all the calculations and spend extra time scheduling between the different threads. With a multi-processor system, the scheduler may be able to run the same process over multiple CPUs by simultaneously running a different thread on each.

Secondly, it uses global data. And after I went to all that trouble to say it was a dangerous thing to do. And I don't even use mutexes. You might well be offended! However, I wanted to illustrate a case when it is safe to use global data. (Whether it is a good idea to use global data is another issue, and that line of philosophical debate is not the purpose of this article—I am simply showing that global data can be used.) Why is it safe here? Well, the two matrices mat1 and mat2 are read-only; the result matrix is written to, but each of the threads only writes to a specific part of the array which is used to store the resulting column which that thread is working on. There is never any conflict.

A Simple Server

Listing 7 shows one way in which a simple multi-threaded server could be constructed.

Please note, as in all the examples, error checking has been kept to a minimum in order to keep the code simple and readable. “Real” code needs to be much more robust than this. Always check the man pages for possible return-codes and error-values.

The server works as follows: it listens on port 12345 to service requests from clients, and it also performs an unrelated task without regard to client activity (that is, it prints “server is running” onto standard output every second).

In thread terms, it works as follows: the main thread creates a thread to listen for clients connecting to the server port, then it goes into an infinite loop, printing the “running” message, then sleeping for one second. The “client listening” thread listens to a socket on port 12345. Whenever a client connects to that socket (e.g., someone types “telnet hostname 12345”) the connection is accepted and another thread is started to handle just that one client. The “client listening” thread then continues to listen for more clients connecting. The “client handling” thread prints a useful message to the client (“type `X' to quit”), then waits for the client to send a message back. If the message begins with “X”, the socket is closed and the thread exits; otherwise, the message is printed again and the thread waits once more for data from the client.

In a server written in this way, there is a thread running for each client that is simultaneously connected. This gives the advantages of a server which forks a new process for each client that connects, but without the extra overheads involved in forking a new process or in switching between processes, and with the ability to communicate easily within threads.

Summary

Writing multi-threaded programs need not be difficult (certainly not as difficult as some people would have you believe), although it is certainly possible to make life difficult for yourself. It is more often useful on multi-processor systems than uni-processor systems. But there are many cases where performance can be improved on uni-processor systems by multi-threading, such as when handling requests from multiple simultaneous connections in a client/server application, when overlapping multiple I/O requests, or when writing programs which make use of graphical user interfaces.

Most of the topics discussed here should be applicable to non-Linux systems and (somewhat more loosely) to non-POSIX systems.

Martin has been a software-engineer for 10 years and has been using Linux since version 0.12. But he has more fun playing in a couple of bands and decorating his flat with Celtic knot-work. He can be reached at marty@ehabitat.demon.co.uk.

Load Disqus comments