From the Publisher
The end of March is quickly approaching as I write. Here at SSC, March is always an exciting time as it is the end of our accounting year. For the last two weeks Gena Shurtleff and I (with help from others) have been working on a budget for the next year of LJ--something that has made some of us very grouchy. [mainly, our publisher --Ed.]
To make this process even more fun, there have been a lot of computer-related problems—all related to our Linux systems. [Note: if you are humor-impaired you may want to skip a lot of this.] First, our main server failed. Linux apparently caused a pin to break off the cable to the external SCSI disk drive. Then, a week or so later, Linux broke a head on the disk drive in our firewall. Next, the fan in our editor's computer started growling at her. On top of all that, various systems in the office were mysteriously crashing or exhibiting very strange behavior in general. The end result was too many hours of downtime, lost e-mail and an unhappy working relationship with our computers.
We began to wonder if Linux was good enough for us, and if perhaps Windows NT might support “automatic pin re-soldering” and “disk-head replacement”.
Once things calmed down we took a closer look at the problems. By the way, we refers specifically to Jay Painter, our new systems administrator, Peter Struijk and me. First, we concluded that it probably wasn't the fault of Linux that hardware was breaking. As scary as it is that multitasking operating systems write to disks whenever they think it's a good time, this really isn't a reason for a cable to break.
To address the issue of being down for longer periods of time than we thought appropriate, let's look at some specific cases.
Note: at this time, our various systems ran a host of different software versions (kernels and C libraries), and we were in the midst of converting them to Debian in order to have consistency across all machines.
The failure of the cable caused the data on the server disks to be scrambled. Fortunately, we had back-up copies of the files, so it seemed like a good time to do an upgrade. We made the logical decision to reload a standard Slackware system (rather than try to change to Debian), restore the user files and be on our way. It turned out to be not so easy. The new load of Slackware had more differences from the old version than we expected. Libraries were different. NFS and NIS were different. Adobe fonts we use for doing reference cards had to be reloaded. Configuration files for groff had to be updated. A lot of work was done to get the new configuration talking to all the old systems and to get everything tuned.
Possibly inspired by the extra work the firewall was doing during the time the server was down, a head died on the disk in the firewall the next weekend resulting in the loss of a lot of mail. Why was it lost? Why wasn't it queued and then forwarded? Was this another Linux shortcoming?
On investigation we found that backup MX (mail exchanger) records were in place to take care of this very problem; however, they were pointing to the wrong machine. Again, the problem could not be pinned on Linux; it was an administrative error by a previous systems administrator. The mistake went undetected because this backup had never been exercised before, since Linux had been working flawlessly.
Let's move on to those strange software problems I mentioned. Surely we can find something to pin on Linux here.
One machine, used as our DNS name server, had been less than reliable. Two things in particular happened quite regularly. The first was that syslogd, the system log daemon, would hang in a loop eating up all available CPU time. While this problem appeared to be related to the location of the log file disappearing (caused by a reboot of the file server or a network problem), we haven't been able to fix it. However, it doesn't appear to happen in newer Linux releases (our problem machine is running 1.2.13) and, while it is irritating, it does not cause the machine to crash—just to run slower than normal.
The other problem on this same machine was stranger, although it turned out to be fixable. Multiple copies of crond (the cron daemon) kept appearing on the machine even though only one was initiated at boot time. One day, I found 13 crond jobs running, killed 12 and, a few hours later, found three still running.
At this point Jay jokingly said, “Maybe there is a cron job starting cron jobs.” Well, since there were processes being started by cron that didn't exist on the other machines, I started looking around for suspicious jobs. The first couple of extraneous jobs I found were benign, but then I found one that made both of us realize that Jay's attempt at a joke wasn't a joke at all. There was, in fact, a cron job that initiated another cron job. Or, more accurately, a cron job that grepped for everything but a cron job, attempted to kill all cron jobs and then started a new cron job. In other words, it looked like a partially written script to do “who knows what”--nothing that would actually work. It was signed and dated, so we could see both who wrote it and that the creation date was about the time when stability problems first appeared. Again, we had found “pilot error”, not a problem with Linux.
There are more of these stories than there is space to tell them. Basically what we found out was that even though various distributions may have some kinks in them like a wrong file permission at install time, they do all install. That's true for Caldera, Linux FT, Debian, Red Hat, Slackware, Yggdrasil and all the others. Software does not wear out. If the system is running it is not likely to stop aside from hardware errors. As a case in point, I still have a 0.99 kernel running on the main machine at fylz.com that was installed in August, 1993. It is an NFS server with three 38.8KB modems on it. The hardware is a 386DX40 with 8MB of RAM. Why haven't I upgraded it? It works, and it is extremely stable. The last reboot was in November 1996, when I turned off the machine to remove a zip drive from the SCSI bus.
We are now back up and running and continuing our conversion to Debian. We are fixing problems as we find them—not just hardware and software but also administrative. For example, if we had been able to find the cellular phone number of our ISP, we could have had him address some problems much more quickly than by communicating via e-mail from a system that was mostly broken.
Before I get into specifics, let me say, no matter what software you are running, it is hard to program around hardware malfunctions. Detecting and halting them is about the best you can do.
You can also do backups and document. At some point we had a configuration disk for our firewall; but when we needed to replace the hard disk, the configuration disk had vanished. This loss cost hours of work time and probably a day of uptime. Having a complete backup of everything, boot disks for all machines, spare cables and disk drives and other assorted parts can make a big difference in the elapsed time to deal with a problem.
Another essential element that we are missing is a monitoring system. If something breaks at 3AM on Sunday, it may not be noticed until 8AM Monday. Being able to detect a problem immediately is crucial to getting it fixed in a timely manner. Even if it is just one person's workstation, having it fixed or replaced before the user comes to work on Monday makes for much higher effective reliability for the network.
One other innovation that we are working on is a backup file server to perform daily backups. Scripts will copy everyone's home directories over to this machine and then its disks will be written to tape. Then, if the main server fails, we can just move this machine in to take over as main.
One other thing that still seems to be a problem is a reliable Auto Mount Daemon. While we have run AMD semi-successfully, it has proved to be somewhat unstable in our configuration. AMD depends on NIS, and there have been failures caused by the physical network or the yp server.
To address this problem, Jay had an idea. Note that it is still an idea not a design specification, but it is worth mentioning. Here it is in his words:
Anyway, my idea for a new VFS-based automount would be to have a daemon which mounted a special VFS file system similar to PROC and then, through a series of system calls, populated it according to its automount maps; NIS or file or otherwise. Whenever someone descended into a directory, the kernel would send a standard Unix signal to the daemon, which would execute a system call to figure out what it had to do. This would prevent the daemon from having to use shared memory or other system V IPC, and it would allow the daemon to be ported to Berkeley OSes.
For those not familiar with it, VFS is an interface layer between the user and the actual file system being supported. This made implementation of the DOS, System V and other file systems much easier than on systems without VFS. Implementing an auto-mount file system should be relatively easy using VFS, and it could offer an integrated, reliable solution to a problem that plagues many operating systems.
One other thing that would be nice is a journaling file system (JFS). Unix System V, Release 4 has one and it is probably the one place where SVR4 is superior to Linux. As this is an editorial rather than a technical article, let's just say that JFS maintains enough information to reconstruct the exact file system structure after a crash without having to revert to the seek out and destroy sort of approach that fsck uses. This means a lower probability of losing anything and much quicker reboot times after a crash.
That's it from this end. If there are people out there interested in working on either of these last two projects, drop me a line at phil@ssc.com. Maybe we can be better than everyone after all.