Booting the Kernel
A computer system is a complex machine, and the operating system is an elaborate tool that orchestrates hardware complexities to show a simple and standardized environment to the end user. When the power is turned on, however, the system software must boot the kernel and work in a limited operating environment. I describe here the booting process of three platforms: the old-fashioned PC and the more fully featured Alpha and SPARC platforms. The PC is covered in more detail, since it is still in more widespread use than other platforms, and also because it's the most tricky platform to bring up. No code will be shown, as assembly language is unintelligible to most readers, and each platform has its own.
In order to be able to use the computer when the power is turned on, the processor begins execution from the system's firmware. The firmware is “unmovable software” found in ROM; some manufacturers call it the Basic Input-Output System (BIOS) to underline its software role, some call it PROM or “flash” to stress its hardware implementation, while others call it “console” to focus on user interaction.
The firmware usually checks the hardware's functionality, retrieves part (or all) of the kernel from a storage medium and executes it. This first part of the kernel must load the rest of itself and initialize the whole system. I don't deal with firmware issues here with the kernel code, which is distributed with Linux.
When the x86 processor is turned on, it is a 16-bit processor that sees only 1MB of RAM. This environment is known as “real mode” and is dictated by compatibility with older processors of the same family. Everything that makes up a complete system must live within the available megabyte of address space, i.e., the firmware, video buffers, space for expansion boards and a little RAM (the infamous 640KB) must all be there.
To make things difficult, the PC firmware loads only half a kilobyte of code and establishes its own memory layout before loading this first sector. Whatever the boot media, the first sector of the boot partition is loaded into memory at the address 0x7c00, where execution begins. What happens at 0x7c00 depends on the boot loader being used; we examine three situations here: no boot-loader, LILO, Loadlin.
Even though it's rare to boot the system without a boot loader, it is still possible to do so by copying the raw kernel to a floppy disk. The command cat zImage > /dev/fd0 works perfectly on Linux, although some other Unix systems can do the task reliably only by using the dd command. Without going into detail, the raw floppy image created by zImage can then be configured using the rdev program.
The file called zImage is the compressed kernel image that resides in arch/i386/boot after either make zImage or make boot is executed—the latter invocation is the one I prefer, as it works unchanged on other platforms. If you built a “big zImage” instead, the kernel file created is called bzImage and resides in the same directory.
Booting an x86 kernel is a tricky task because of the limited amount of available memory. The Linux kernel tries to maximize usage of the low 640 kilobytes by moving itself around several times. Let's look at the steps performed by a zImage kernel in detail; all of the following path names are relative to the arch/i386/boot directory.
The first sector (executing at 0x7c00) moves itself to 0x90000 and loads subsequent sectors after itself, getting them from the boot device using the firmware's functions to access the disk. The rest of the kernel is then loaded to address 0x10000, allowing for a maximum size of half a megabyte of data—remember, this is the compressed image. The boot sector code lives in bootsect.S, a real-mode assembly file.
Then code at 0x90200 (defined in setup.S) takes care of some hardware initialization and allows the default text mode (video.S) to be changed. Text mode selection is a compile-time option from 2.1.9 onwards.
Later, all the kernel is moved from 0x10000 (64K) to 0x1000 (4K). This move overwrites BIOS data stored in RAM, so BIOS calls can no longer be performed. The first physical page is not touched because it is the so-called “zero-page”, used in handling virtual memory.
At this point, setup.S enters protected mode and jumps to 0x1000, where the kernel lives. All the available memory can be accessed now, and the system can begin to run.
The steps just described were once the whole story of booting when the kernel was small enough to fit in half a megabyte of memory—the address range between 0x10000 and 0x90000. As features were added to the system, the kernel became larger than half a megabyte and could no longer be moved to 0x1000. Thus, code at 0x1000 is no longer the Linux kernel, instead the “gunzip” part of the gzip program resides at that address. The following additional steps are now needed to uncompress the kernel and execute it:
head.S in the compressed directory is at 0x1000, and is in charge of “gunzipping” the kernel; it calls the function decompress_kernel, defined in compressed/misc.c, which in turns calls inflate which writes its output starting at address 0x100000 (1MB). High memory can now be accessed, because the processor is definitely out of its limited boot environment—the “real” mode.
After decompression, head.S jumps to the actual beginning of the kernel. The relevant code is in ../kernel/head.S, outside of the boot directory.
The boot process is now over, and head.S (i.e., the code found at 0x100000 that used to be at 0x1000 before introducing compressed boots) can complete processor initialization and call start_kernel(). Code for all functions after this step is written in C.
The various data movements performed at system boot are depicted in Figure 1.
Figure 1. System Boot Data Map
The boot steps shown above rely on the assumption that the compressed kernel can fit in half a megabyte of space. While this is true most of the time, a system stuffed with device drivers might not fit into this space. For example, kernels used in installation disks can easily outgrow the available space. Some new method is needed to solve the problem—this new method is called bzImage and was introduced in kernel version 1.3.73.
A bzImage is generated by issuing make bzImage from the top level Linux source directory. This kind of kernel image boots similarly to zImage, with a few changes:
When the system is loaded to address 0x10000, a little helper routine is called after loading each 64K data block. The helper routine moves the data block to high memory by using a special BIOS call. Only the newer BIOS versions implement this functionality, and so, make boot still builds the conventional zImage, though this may change in the near future.
setup.S doesn't move the system back to 0x1000 (4K) but, after entering protected mode, jumps instead directly to address 0x100000 (1MB) where data has been moved by the BIOS in the previous step.
The decompresser found at 1MB writes the uncompressed kernel image into low memory until it is exhausted, and then into high memory after the compressed image. The two pieces are then reassembled to the address 0x100000 (1MB). Several memory moves are needed to perform the task correctly.
The rule for building the big compressed image can be read from Makefile; it affects several files in arch/i386/boot. One good point of bzImage is that when kernel/head.S is called, it doesn't notice the extra work, and everything goes forward as usual.
Most Linux-x86 users don't boot the raw kernel image from a floppy; instead they boot LILO from the hard disk. LILO replaces part of the process outlined above so that it can load a Linux kernel that is scattered throughout a disk. This capability allows the user to boot a kernel file from a file system partition without using the floppy.
In practice, LILO uses the BIOS services to load single sectors from the disk, and then it jumps to setup.S. In other words, it arranges the memory layout in the same way as bootsect.S; thus, the usual booting mechanism can complete painlessly. LILO is also able to handle a kernel command line, and this is a good reason by itself to avoid booting the raw kernel image.
If you want to boot a bzImage with LILO, you must use LILO version 18 or later. Earlier versions of LILO are not able to load segments into high memory, an ability that is needed when loading big images in order for setup.S to find the expected memory layout.
The main disadvantage of LILO is that is uses the BIOS to load the system. This forces the kernel and other relevant files into the first 1024 cylinders of disks to be accessible to the BIOS. When using the PC firmware, you discover how old-fashioned the architecture actually is.
Even if you don't run LILO, you can enjoy the documentation files distributed with LILO's source code. They document the boot process on the PC and explain how to handle (almost) every conceivable situation.
If you want to boot your operating system from another operating system, Loadlin is the tool for you. This program is similar to LILO in that it loads the kernel from a disk partition and then jumps to setup.S. It is different from LILO in that it not only faces the BIOS restrictions, but also must dispose of an established memory layout without compromising the system's stability. On the other hand, Loadlin is not restricted to a half kilobyte length because it is a complete program file, not a boot sector. Version 1.6 and later of Loadlin are able to load big images.
Loadlin can pass a command line to the kernel and is, therefore, as flexible as LILO. Most of the time, you'll write a linux.bat file to pass a full-featured command line to Loadlin when calling the linux command.
Loadlin can be used to turn any networked PC into a Linux box. All that is needed is a kernel image equipped for mounting the root partition via NFS, Loadlin and a linux.bat containing the correct IP numbers. You need a properly configured NFS server as well, but any Linux machine can fill that job. For example, the following command line turns a PC (alfred.unipv.it) into a workstation:
loadlin c:\zimage rw nfsroot=/usr/root/alfred \ nfsaddrs=193.204.35.117:193.204.35.110:193.204.35.254:255.255.255.0:alfred.unipv.it
The code is not as easy as I described—it must deal with a lot of details, such as bringing around the kernel's command line, keeping an eye on the boot technique being used, and so on. The curious reader can look in the source file to learn more and to read the authors' comments. There's a lot of information in the comments, and they are often funny to read.
I personally feel most users will never need to touch the boot code, because things are much more interesting when the system is up and running. At those times you can exploit all the features of your processor and all the available RAM without going mad with processor-level issues.
The Alpha platform is much more mature than the PC, and its firmware reflects this maturity. My experience with Alpha is limited to the ARC firmware, which is the most widely used.
After performing the usual detection of devices, the firmware displays a boot menu that lets you choose which file to boot. The firmware can read a disk partition (though only a FAT partition), so you actually boot a “file” without the need to hack boot sectors and build maps of disk blocks.
The file booted is usually linload.exe, which in turn loads MILO (the “Mini Loader”). In order to boot Linux through the ARC firmware, you must have a small FAT partition on your hard drive to store linload.exe and milo files. The Linux kernel doesn't need to access the partition unless you upgrade MILO, so FAT support can be safely left out of your Alpha kernel.
Actually, the user can exploit different options. The ARC boot menu can be configured to boot Linux by default, and MILO can be burnt in flash memory in order to get rid of the FAT partition. However, whatever you do, you end up with MILO running.
The MILO program is a stripped-down version of the Linux kernel. It has all of the Linux device drivers and a file system decoder; unlike the kernel it doesn't have process control and does include Alpha initialization code. This tool can set up and enable virtual memory and can load a file from either an ext2 partition or an iso9660 device. The “file” in question is loaded to virtual address 0xfffffc0000300000 and then executed. This virtual address is also the one where the Linux kernel runs; however, it's unlikely you'll ever load anything but Linux. One exception is the fmu (“flash management utility”) program used to burn MILO in flash ROM—fmu is compiled to execute from the same virtual address whence the kernel runs, and it is distributed with MILO.
It's interesting to note that MILO also includes a small 386 emulator and some of the PC BIOS functionality. This is needed in order to execute self-initialization code found on many ISA/PCI peripheral boards (PCI boards, though claiming to be processor-independent, use Intel machine code in their ROM images).
Since MILO does all of this, what is left to the Linux kernel?—very little, actually. The first kernel code to execute in Linux-Alpha is arch/alpha/kernel/head.S, and all it does is set up a few pointers and jump to start_kernel(). Actually, kernel/head.S for Alpha is much shorter than the equivalent x86 source file.
If, for some reason, you don't wish to run MILO there is an alternative, though not a practical one. In arch/alpha/boot you'll find the source for a “raw” loader that is compiled by issuing make rawboot from the top level Linux source directory. This utility can load a file from a sequential region of a device—either floppy or hard disk—using the firmware's “call backs”.
In practice, the raw loader accomplishes a task similar to the one bootsect.S performs for the PC platform—it forces a copy of the kernel to either a raw floppy or a raw hard disk partition. There's no real reason to use this technique—it is quite hairy and lacks the flexibility MILO offers. I personally don't know if it still works; the “PALcode” used by Linux is exported by MILO and is different from the one exported by the ARC firmware. The PALcode is a library of low-level functions used by Alpha processors to implement low-level hardware management like paging; if the current PALcode implements different operations than the software expects, the system won't work.
Bringing up a SPARC computer is similar to booting the Alpha on the user side, and similar to booting the PC on the software side.
The user sees that the firmware loads and executes a program, which in turn is able to retrieve and uncompress a file found on a disk partition. The “program” in question is called SILO, and it can read files from either a ext2 or a ufs partition. Unlike MILO (like LILO), SILO is able to boot another operating system. There is no need for this ability on the Alpha, because the firmware can boot multiple systems; once you run MILO, you have already made your choice (the right choice—Linux).
When a SPARC computer boots, the firmware loads a boot sector after performing all the hardware checks and device initialization. It's interesting to note that Sbus devices are platform independent, and their initialization code is portable Forth code rather than machine language bound to a particular processor.
The boot sector loaded is what you find in /boot/first.b in your Linux-SPARC system and is a bare 512 bytes. It is loaded to address 0x4000, and its role is retrieving /boot/second.b from disk and writing it to address 0x280000 (2.5 MB); this address was chosen because the SPARC specifications state that at least 3MB of RAM must be mapped at boot time.
The second-stage boot loader then does everything else. It is linked with libext2.a to access system partitions and can thus load a kernel image from your Linux file system. It can also uncompress the image, since it includes the inflate.c routine from the gzip program.
The routine second.b accesses a configuration file called /etc/silo.conf, similar in shape to lilo.conf. Since the file is read at boot time, there's no need to re-install the kernel maps when a new kernel is added to the boot choices. When SILO shows its prompt, you can choose any kernel image (or other operating system) specified in the silo.conf file, or you can specify a complete device/path name pair to load a different kernel image without editing the configuration file.
SILO loads the disk file to address 0x4000. This means the kernel must be smaller than 2.5MB; if it is larger, SILO will refuse to overwrite its own image. No conceivable Linux-SPARC kernel currently exceeds that size, unless it was compiled with -g to have debugging information available. In this case, the kernel image must be stripped before being handed to SILO.
Finally, SILO performs kernel decompression and/or remapping to place the image at virtual address 0xf0004000. The code that takes over after SILO is finished is arch/sparc/kernel/head.S. The source includes all the trap tables for the processor and the actual code to set the machine up and call start_kernel(). The SPARC version of head.S is quite big.
After architecture-specific initialization is complete, the init/main.c program takes control of whichever processor you are using.
The start_kernel() function first calls setup_arch(), which is the last architecture-specific function. Unlike other code, however, setup_arch() can exploit all the processor's features and is a much easier source file to deal with than those described earlier. This function is defined in the kernel/setup.c code under each architecture source tree.
The start_kernel() function then initializes all the kernel's subsystems—IPC, networking, buffer cache and so on. After all initialization is done, these two lines complete the code:
kernel_thread(init, NULL, 0); cpu_idle(NULL);
The init thread is process number 1: it mounts the root partition, executes /linuxrc if CONFIG_INITRD has been selected at compile time, and then executes the init program. If init can't be found, /etc/rc is executed. In general, using rc is discouraged, since init is much more flexible than a shell script in handling system configuration. As a matter of fact, version 2.1.21 of the kernel removed the /etc/rc{/} option, making it obsolete. If neither init nor /etc/rc will run or if they exit, /bin/sh is executed repeatedly (but 2.1.21 and later kernels will execute it only once). This feature only exists as a safeguard in case the init file is removed or corrupted by mistake. If you remove a.out support from the kernel without recompiling your old init, you'll enjoy having at least a shell running after reboot. The kernel has no more tasks to do after spawning process number 1, and all other functions are handled in user space by init, /etc/rc or /bin/sh. And process 0? The so called “idle” task executes cpu_idle(), a function that calls idle() in an endless loop. idle() in turn is an architecture-dependent function that is usually in charge of turning off the processor to save power and increase the processor's lifetime.
Alessandro is a Linux enthusiast who writes documentation because he's not smart enough to write software. His 486 is specialized in grepping through source code, and humbly leaves real jobs to the Alpha and the SPARC. He can be reached via e-mail at rubini@ipvvis.unipv.it.