Playing with Binary Formats
One of the roles that kernel modules can accomplish is adding new binary formats to a running system. A “binary format” is basically a data structure responsible for executing program files—the ones marked with execute permission. The code I'm going to introduce runs with version 2.0 of the kernel.
Kernel modules are meant to add new capabilities to a Linux system, device drivers being the best known such “capabilities”. As a matter of fact, the highly modular design of the Linux kernel allows run-time insertion of many features other than device drivers—we saw a few months ago how /proc files and sysctl entry points can be created by modularized code.
One other such loadable feature is the ability to execute a binary format; this includes both executable files and shared libraries. While the mechanism of loading compiled program files and shared libraries is quite elaborate, the average Linux user can easily add loaders that invoke an interpreter for new binary formats. Thus, the user is able to call data files by name and have them “executed”, after invoking chmod +x on the file.
Let's start this discussion by looking at how the exec system call is implemented in Linux. This is an interesting part of the kernel, as the ability to execute programs is at the basis of system operations.
The entry point of exec lives in the architecture-dependent tree of the source files, but all the interesing code is part of fs/exec.c (all pathnames here refer to /usr/src/linux/ or the location of your sources). To check architecture-specific details, locate the function by typing the command:
arch/*/kernel/*.c
Within fs/exec.c the toplevel function, do_execve(), is less than fifty lines of code in length. Its role is checking for errors, filling the “binary parameter” structure (struct linux_binprm) and looking for a binary handler. The last step is performed by search_binary_handler(), another function in the same file. The magic of do_execve() is contained in this last function which is very short. Its job consists of scanning a list of registered binary formats, passing the binprm structure to all of them until one succeeds. If no handler is able to deal with the executable file, the kernel tries to load a new handler via kerneld and scans the list once again. If no binary format is able to run the executable file, the system call returns the ENOEXEC error code (“Exec format error”).
The main problem with this kind of implementation is in keeping Linux compatible with the standard Unix behaviour. That is, any executable text file that begins with #! must be executed by the interpreter it asks for, and any other executable text is run by /bin/sh. The former issue is easily dealt with by a binary format specialized in running interpreter files (fs/binfmt_script.c), and the interpreter itself is run by calling search_binary_handler() once again. This function is designed to be reentrant, and binfmt_script checks against double invocation. The latter issue is mainly an historical relic and is simply ignored by the kernel. The program trying to execute the file takes care of it. Such a program is usually your shell or make. It's interesting to note that while recent versions of gmake execute properly when a script has no leading #! line, previous versions didn't call a shell resulting in a “cannot execute binary file” message when running unadorned scripts from within a Makefile.
All the magic handling of data structures needed to replace the old executable image with the new one is performed by the specific binary loader, based on utility functions exported by the kernel. If you would like to take a look at such code, the function load_out_binary() in fs/binfmt_aout.c is a good place to start—easier than the ELF loader.
The implementation of exec is interesting code, but Linux has more to offer: registration of new binary formats at run time. The implementation is quite straightforward, although it involves mucking with rather elaborate data structures—either the code or the data structures must accomodate the underlying complexities; elaborate data structures offer more flexibility than elaborate code.
The core of a binary format is represented in the kernel by a structure called struct<\!s>linux_binfmt, which is declared in the linux/binfmts.h file as follows:
struct linux_binfmt { struct linux_binfmt *next; long *use_count; int (*load_binary)(struct linux_binprm *, struct pt_regs *); int (*load_shlib)(int fd); int (*core_dump)(long signr, struct pt_regs *); };
The three functions, or “methods”, declared by the binary format are used to execute a program file, to load a shared library and to create a core file. The next pointer is used by search_binary_handler(), while the use_count pointer keeps track of the usage count of modules. Whenever a process p is executing in the realm of a modularized binary format, the kernel keeps track of *(p->binfmt->use_count) to prevent unexpected removal of the module.
A module, then, uses the following functions to load and unload itself:
extern int register_binfmt(struct linux_binfmt *); extern int unregister_binfmt(struct linux_binfmt *);
The functions receive a single argument instead of the usual pointer,name pair because no file in the /proc directory lists the available binary formats. The typical code for loading and unloading a binary format, therefore, is as simple as the following:
int init_module (void) { return register_binfmt(&bluff_format); } void cleanup_module(void) { unregister_binfmt(&bluff_format); }The previous lines belong to the bluff module (Binary Loader for an Ultimately Fallacious Format), whose source is available for public download from ftp://ftp.linuxjournal.com/pub/lj/listings/issue45/2568.tgz.
The structure representing the binary format can declare as NULL any of the functions it offers; NULL functions are simply ignored by the kernel. The easiest binary format, therefore, looks like the following, which is the one used by the bluff module:
struct linux_binfmt bluff_format = { NULL, &mod_use_count_, /* next, count */ NULL, NULL, NULL /* bin, lib, core */ };
Yes, bluff is a bluff; you can load and unload it at will, but it does absolutely nothing.
In order to implement a binary format that is of some use, the programmer must have some background information about the arguments that are passed to the loading function, i.e., format->load_binary. The first such argument contains a description of the binary file and the parameters, and the second is a pointer to the processor registers.
The second argument is only needed by real binary loaders, like the a.out and ELF formats that are part of the Linux kernel sources. When the kernel replaces an executable file with a new one, it must initialize the registers associated with the current process to a sane state. In particular, the instruction pointer must be set to the address where execution of the new program must begin. The function start_thread is exported by the kernel to ease setting up the instruction pointer. In this article I won't go so deep as to describe real loaders but will limit the discussion to “wrapper” binary formats, similar to binfmt_script and binfmt_java.
The linux_binprm structure, on the other hand, must be used even by simple loaders, so it is worth describing here. The structure contains the following fields:
char buf[128]: This buffer holds the first bytes of the executable image. It is usually looked up by each binary format in order to detect the file type. If you are curious about the known magic numbers used to detect the different file formats, you can look in the text file /usr/lib/magic (sometimes called /etc/magic).
unsigned long page[MAX_ARG_PAGES]: This array holds the addresses of data pages used to carry around the environment and the argument list for the new program. The pages are only allocated when they are used; no memory is wasted when the environment and argument lists are small. The macro MAX_ARG_PAGES is declared in the binfmts.h header and is currently set to 32 (128KB, 256KB on the Alpha). If you get the message “Arg list too long” when trying to run a massive grep, then you need to enlarge MAX_ARG_PAGES.
unsigned long p: This is a “pointer” to data kept in the pages just described. Data is pushed to the pages from high addresses to low ones, and p always points to the beginning of such data. Binary formats can use the pointer to play with the initial arguments that are passed to the program being executed, and I'll show such use in the next section. It's interesting to note that p is a pointer to user-space addresses, and it is expressed as unsigned long to avoid an undesired de-reference of its value. When an address represents generic data (or an offset in the memory “array”) the kernel often considers it a long integer.
struct inode *inode: This inode represents the file being executed.
int e_uid, e_gid: These fields are the effective user and group ID of the process executing the program. If the program is set-uid, these fields represent the new values.
int argc, envc: These values represent the number of arguments passed to the new program and the number of environment variables.
char *filename: This is the full pathname of the program being executed. This string lives in kernel space and is the first argument received by the execve system call. Although the user program won't know its full pathname, the information is available to the binary formats, so they can play games with the argument list.
int dont_iput: This flag can be set by the binary format to tell the upper layer that the inode has already been released by the loader.
The structure also contains other fields that are not relevant to the implementation of simple binary formats. What is relevant, on the other hand, are a pair of functions exported by exec.c. The functions are meant to help the job of simple binary loaders such as the ones I'll introduce in this article.
unsigned long copy_strings(int argc,char ** argv, unsigned long *page, unsigned long p, int from_kmem); void remove_arg_zero(struct linux_binprm *bprm);
The first function is in charge of copying argc strings, from the array argv into the pointer p (a user space pointer, usually bprm->p). The strings will be copied before the address pointed to by p (argument strings grow downwards). The original strings, the ones in argv, can either reside in user space or in kernel space, and the array can be in kernel space even if the strings are stored in user space. The from_kmem argument is used to specify whether the original strings and array are both in user space (0), both in kernel space (2) or the strings are in user space and the array in kernel space (1). remove_arg_zero removes the first argument from bprm by incrementing bprm->p.
To turn the theory into sound practice, let's try to expand our bluff into bloom (Binary Loader for Outrageously Ostentatious Modules). The complete source of the new module is distributed together with bluff.
The role of bloom is to display executable images. Give execution permission to your GIF images and load the module, then call your image like it was a command, and xv will display it.
This code is neither particularly original (most of it comes from binfmt_script.c) nor particularly smart (text-only people like me would rather use an ASCII viewer, for instance, and other people prefer a different viewer). I feel this kind of example is quite didactic anyway, and it can be easily run by anyone who can run an X server and has root access to the computer in order to load modules.
The source file is made up of little more than 50 lines and is able to execute GIF, TIFF and the various PBM formats; needless to say, you must give your images execute permissions (chmod +x) in advance. The viewer is configurable at load time and defaults to /usr/X11R6/bin/xv. Here is a sample session copied from my text console:
# insmod bloom.o # ./snowy.tif xv: Can't open display # rmmod bloom # insmod bloom.o viewer="/bin/cat" # ./snowy.tif | wc -c 1067564
If you use the default viewer and work within a graphic session, your image file will bloom on the display.
If you can't wait to download the source file, you can see the interesting part of bloom in Listing 1. Note that bloom.c falls under the GPL, because most of its code is copied from binfmt_script.c.
The next question that I hear you ask is “How can I set up things so that kerneld can automatically load my module?”
Well, actually it isn't always possible. The code in fs/exec.c only tries to use kerneld when at least one of the first four bytes is not printable. This behaviour is meant to avoid losing too much time with kerneld when the file being executed is a text file without the #! line. While real binary formats have one non-printable byte in the first four, this isn't always true for generic data types.
The net result of this behaviour is that you can't automatically load the bloom viewer when invoking a GIF file or when calling a PBM file by name. Both formats begin with a text string and will therefore be ignored by the auto-loader.
When, on the other hand, the file has a non-printing character within the first four, the kernel issues a kerneld request for binfmt-number, where the exact string is generated by this statement:
sprintf(modname, "binfmt-%hd", *(short*)(&bprm->buf));
The ID of the binary format generated by the above statement represents the first two bytes of the disk file. If you try to execute TIFF files, kerneld looks for binfmt-19789 or binfmt-18761. A gzipped file calls for binfmt--29921 (negative). GIF files, on the other hand, are passed to /bin/sh shell due to their leading text string. If you want to know the number associated with each binary format, look in the /usr/lib/magic file and convert the values to decimal. Alternatively, you can pass the debug argument to kerneld and look at its messages when you execute your data files and it tries to load the corresponding binary format.
It's interesting to note that kernel versions 2.1.23 and newer switched to an easier and more significant ID by using the following line:
sprintf(modname, "binfmt-%04x", *(unsigned short *)(&bprm->buf[2]));
This new ID string represents the third and fourth byte of the binary file and is hexadecimal instead of decimal (thus leading to strings with a better format and no ugly “minus-minus” appearing now and then.
While calling images by name can be funny, it has no real role in a computer system. I personally prefer calling my viewer by name, and I do not believe in the object-orientedness of the approach. This kind of feature in my opinion is best suited to the file manager where it can be tailored by appropriate configuration files without introducing kernel bloat to lie in the way of any computational path.
What is really interesting about binary formats is the ability to run program files that don't fall in the handy #! notation. This includes executable files belonging to other operating systems or platforms, as well as interpreted languages that have not been designed for the Unix operating system—all those languages that complain about a #! in the first line.
If you want to play one such game, you can try the fail module. This “Format for Automatically Interpreting Lisp” is a wrapper to invoke Emacs any time a byte-compiled e-lisp program is invoked by name. Such practice is definitely failure-prone, as it makes little sense to invoke several megabytes of program code to run a few lines of lisp. Moreover, Emacs-lisp is not suited to command-line handling. Together with fail you'll also find a pair of sample lisp executables to make your tests.
A real-world Linux system is full of interesting examples of interpreted binary formats such as the Java binary format. Other examples are the binary format that allows the Alpha platform to run Linux-x86 binaries and the one included in recent DOSEMU distributions that is able to run old DOS programs transparently (although the program must be specifically tailored in advance).
Version 2.1.43 of the kernel and newer ones include generic support for interpreted binary formats. binfmt_misc is somewhat like bloom but much more powerful. You can add new interpreted binary formats to the module by writing the relevant information to the file /proc/sys/fs/binfmt_misc.
Listing 1 and all other programs referred to in this article are available by anonymous download in the file ftp.linuxjournal.com/pub/lj/listings/issue45/2568.tgz.
Alessandro Rubini used to read e-mail at his university account, but then abandoned academia because he was forced to write articles. He now reads e-mail as rubini@linux.it and still writes articles.