System Calls
Code in the Linux kernel can be executed in two basic ways. One is to be called by an interrupt, and the other is to be called from a user program (that's my required “white lie” for this column). User programs call code in the kernel through a system call, which is essentially an unusual type of function call.
Of course, when user code calls privileged kernel code, the kernel has to very carefully check the validity of its arguments in order to avoid accidentally doing harm of any sort. If the code is not safe for anyone but the superuser to execute, there are routines for checking that, too.
Creating a system call is more difficult than creating a normal C language function, but not too difficult. There is certainly more to it than declaring a function in a header file—and for system calls, the only change that is needed to a header file is not a function declaration.
The first thing that you need to do is either modify an existing file in the kernel, or create a new file to be compiled. If you create a new file, we will assume that you are able to add it to the appropriate Makefile and use the proper #include statements for the code you are writing. You will want to make sure that <linux/errno.h> is included, because system calls need to be able to return error codes, and those error codes are all defined in errno.h.
You will need to create a function called sys_name, where name is the name of the system call you are creating. The function must have the return specification asmlinkage int, and it may have any number of arguments between 0 and 5, inclusive. The arguments must all be the same size as a long; they may not be structures. (Or, at least, not structures larger than a long. It would not be wise to make structures the same size as a long because integer arithmetic is done on them. What is a “signed” structure? If you don't want to think about that question, do not use small structures. In truth, don't use them at all.)
The function will return errors as -ENAME. Negative numbers are treated as error values on return (we will see how later) and positive numbers are considered normal return values. This means that on systems with 32-bit long values, only 31 bits are available for passing back return values. On 64-bit systems like Linux/Alpha, only 63 bits are available. This makes it difficult to pass addresses in the high half of the range back to user programs.
There are two ways around this. One is to make one of the function's arguments be the address of a user-space variable in which to place the return value. The other is to find some other way of returning an error and making a special way of handling the return value. The first way is, to the best of my knowledge, always preferable, so I will not explain the second way.
Before reading or writing any area in a user program from the kernel, the verify_area() function must be called. In normal use on a 486 or Pentium, it is less important for kernel stability than on the 386 (although it helps detect errors much more cleanly and avoids having processes die in kernel mode), but on the 386 it is absolutely essential to system stability, because the 386 does not honor memory protection when it is in “supervisor” mode, which is the mode the kernel runs in. This means, for instance, that the CPU will happily write to read-only user-space memory from the kernel.
The verify_area() function takes three variables. First is one of VERIFY_READ or VERIFY_WRITE. Second is the address in the current user program that is to be verified. Third is the length of the memory area you wish to read or write. It returns 0 if the memory area is valid, and -EFAULT if the memory is not valid. A common phrase is something like this:
int error; error = verify_area(VERIFY_WRITE, buf, len); if (error) return error; ...
Please note that verify_area only verifies addresses in user memory space, not kernel memory space. Memory in kernel space is never swapped out, and is always readable and writable. On the i86 family, the fs segment register is used in the kernel to select the user-space memory of the current process. Other architectures do this differently. This functionality is abstracted out into a few useful functions, explained below.
Your work when writing your system call will be much easier if you do as much testing as possible before committing any resources to the task at hand. As a general rule, tests are done in this order:
Run all necessary verify_area tests.
Do (almost) all other tests in an appropriate order, including normal permission testing.
Do suser() or fsuser() tests if appropriate. These should only be called after other tests have succeeded, because BSD-style root-privilege accounting may be added to the kernel at some point. See the comments in include/linux/kernel.h.
The suser() function is used to determine if the process has root permissions for most activities. However, the fsuser() function must be used for all filesystem-related permissions. This difference allows servers to assume the file permissions of a user without “becoming” the user, even briefly. This is important because if the server exchanges uid's such that it “becomes” the user for even a moment, the user can disturb the process in various ways, potentially breaching security in many ways. By simply using the fsuid and fsgid functions instead, the server avoids this security nightmare. For this to work, all kernel filesystem permissions testing must use the fsuser() function to test for superuser status, and must look at current->fsuid and current->fsgid for normal permissions on filesystem objects. (For more details on the current pointer, see the definition of task_struct in include/linux/sched.h.)
A good example of a program that needs this ability is the nfs server. Early versions of the nfs server were not able to use this functionality (because it didn't yet exist), and there were several security holes. The most common nuisance was users noticing that they could kill the server.
After you check permissions and any other possible error conditions, you probably want to actually get something done. Unless you simply want to return a value that can fit in a 31-bit (or 63-bit for Linux/Alpha) return value, you will need to write to the user memory that you checked with the verify_area function at the beginning of function. You can't just use the pointer to user-space memory as a normal pointer. Instead, you have to use a set of special functions to access it. And if you want to read any user-space memory in order to do your system call, you will need to use a similar set of functions to do so.
In older versions of Linux (through 1.2.x), you had to specify what kind of memory access you were making. There were 6 functions for single memory access: get_fs_byte, get_fs_word, get_fs_long, put_fs_byte, put_fs_word, and put_fs_long. These names (and names with the fs replaced with user) are still supported in newer kernels, but starting with Linux 1.3, they are deprecated. The get_user and put_user functions are to be used instead. They are easier to read and for the most part easier to use, but because they depend on the type of the pointer being passed to them, they are not tolerant of sloppy pointer use. (This is probably a good thing, since Linux now runs both on little- and big-endian computers, and big-endian computers are not tolerant of sloppy pointer use either.)
The memory block access routines have stayed the same since the earliest versions, even though their names still contain the letters “fs”; memcpy_tofs is used to copy a block of memory to user space, and memcpy_fromfs is used to copy a block of user memory to memory in kernel space.
All of the memory access routines are defined in include/asm/segment.h—even on architectures without segmentation. On all of the non-Intel architectures, these functions are essentially null functions, since they do not implement segmentation.
Up to this point, you have simply implemented a new function in the kernel. Simply prepending the name with sys_ will not make it possible to call the function from user code.
You need to make two additions within the kernel. The first is in include/linux/unistd.h, right near the end. You need to look for the last line that starts with #define __NR and add your own:
#define __NR_name ###
where ### is the number one greater than the previous last system call number. In version 1.2.9, that would be 141.
The second change will have to be made in multiple files, one for each architecture that Linux runs on. Each file arch/*/kernel/entry.S will need an additional entry in its system call table. The system call table is kept at the end of the file, and you will simply need to add an entry at the end of the table before the .space line and change the .space formula at the very end to reflect the new number of system calls.
Now you can call your new function from user code, but how? You can't simply declare extern int sys_name(int arg); and link. Instead, you have to #include <unistd.h> and use the appropriate syscallX() macro, where X is the number of arguments the system call takes. The syscallX() macros are actually defined in include/asm/unistd.h, which gets included by <unistd.h> automatically.
If your system call is declared as
asmlinkage int sys_name(void);
the syscall0() invocation is quite easy:
_syscall0(int, name)
(notice the leading underscore). This gets converted by the C preprocessor into
int name(void) { long __res; __asm__ volatile ("int $0x80" : "=a" (__res) : "0" (__NR_name)); if (__res >= 0) return (int) __res; errno = -__res; return -1; }
on Linux/i86. Because it uses assembly, it will be different on other architectures. Fortunately, it doesn't really matter. The important point is that it creates a function called name which generates an interrupt (remember the “white lie” about interrupts? System calls are interrupts, too) which calls the system call, and then returns the result if the answer is positive, and returns -1 if the answer is negative (has the high-order bit set), setting errno to the non-negative error number.
If your function has two arguments:
asmlinkage int sys_name(int num, struct foo *bar);
you would instead use this:
_syscall2(int, name, int, num, struct foo *, bar)
which would expand to:
int name(int num, struct foo * bar)
{
long __res;
__asm__ volatile ("int $0x80"
: "=a" (__res)
: "0" (__NR_name),
"b" ((long)(num)), "c" ((long)(bar)));
if (__res >= 0)
return (int) __res;
errno = -__res;
return -1;
}
Notice the unusual way of specifying the arguments to the macro, where the return type and the name of the function are followed by separate arguments for the type and name of each of the system call's arguments. Figuring out how to specify system calls with 1, 3, 4, or 5 arguments is left as an exercise for the reader.
For the curious: there is one other way that system calls may be called on Linux/i86. iBCS2-based programs call system calls with an lcall 7,0 instruction instead of an int $0x80 instruction. The lcall instruction takes slightly longer than the int instruction, which is why it is the default system call mechanism on Linux, but both are supported. The lcall instruction isn't exactly an interrupt, although it acts much like one; technically it is a “call gate”. So my “white lie” isn't really a lie after all.
Michael K. Johnson is the Editor of Linux Journal, and pretends to be a Linux guru in his spare time. He can be reached via e-mail as johnsonm@ssc.com.