Recovering from a Hard Drive Failure
Have you ever woke up in the morning and said to yourself, “today is the day that I'm finally going to backup my workstation!” only to find out that you're a day late and about 320Gb short? Well, that's about what happened to me recently, but don't worry, the story has a happy ending. I'm getting ahead of myself though.
Most people's excuse for not performing routine maintenance or regular backups is that they just don't have time. So when I discovered that I had some down time, I decided to take to take care of a few issues on my workstation. I performed a system update. Since I leave my system on all the time, I decided to upgrade the kernel and try to get software suspend working so I could cut down on energy consumption and heat production in my office. Finally, I resolved to finish backing up my home directory.
The system update went without incident and the kernel compiled and installed without error. The next step was to reboot into the new kernel. When the kernel panic'ed, I figured that I had missed something in the kernel configuration, so I rebooted back to my older kernel, which also panic'ed. Since this system had been running not 15 minutes ago, I knew things were about to get ugly.
At this point, I remembered that I had been doing some testing with an Ubuntu live CD, so I booted the live CD. At least now, I could get some work done, even though my workstation was “toes up.” This would also give me a platform from which to work on my regular hard drive, or so I thought. When I attempted to mount /dev/sda3, I was told that it didn't exist. Fdisk told me that my partition table was mostly gone! All that was left was /dev/sda1, where I keep my kernel, and /dev/sda2, which is where I swap. I posted a message describing my situation to the Gentoo user's group and was told that I should look into a program called testdisk.
I figured that I should at least assess /dev/sda1, so I tried to mount it. No such luck. The filesystem wasn't recognized. A quick look at /proc/filesystems told me that Ubuntu hadn't loaded ext2 support into the kernel. Further investigation revealed that Ubuntu loaded all of it's drivers from an initial ram disk and they weren't immediately available in /lib/modules. I couldn't bring myself to dissect an initial ram disk image on a system that was RUNNING on a ram disk, so out came the Gentoo installation CD.
It was while watching the Gentoo CD boot, that I saw the IDE disk seek error messages for the first time. I don't reboot my system very often and the Ubuntu live CD hides those messages from you, so who knows how long I'd been working with a drive that needed to be replaced?
Once the Gentoo CD had booted, it was time to try to recover my system. I discovered that testdisk wasn't installed on the CD, so I had to wget and untar it first. Oddly enough, I had to run testdisk and reboot a couple times before I had a partition table that looked sane. When I tried to mount the filesystem, I was told that mount couldn't find a valid filesystem. As a list ditch effort, I decided to try to fsck the filesystem anyway. The fsck program reported that it couldn't find a superblock, but this was the first good news I had received so far; I knew I could use the -b parameter and ask fsck to use a backup superblock. At least fsck hadn't choked completely. So, I issued a command like fsck -y -t ext2 -b 8192 /dev/sda3 to see what would happen. When fsck started to spew error messages indicating fix-ups it was performing, I decided that the process would take a while and went to be for the night.
When I woke up, I found that fsck had finished so I mounted the resulting filesystem. I was really hoping to see all of my files intact, but no, all I saw was /lost+found. When I cd'ed into the lost+found directory, I got my first glimpse of just how bad things had been. The fsck program had done it's job and recovered my filesystem, but it was unable to recover any of the file names at the root of the partition, so it moved the files to the lost+found directory and renamed each file after it's I-node number. All I had was a list of files and directories with names resembling #19539303. And the directory list was several screens in length; I usually keep a pretty clean / directory, so obviously, fsck had encountered a lot of trouble.
One of these oddly-named directories was my /home directory. I made an educated guess as to which one that was and sure enough, I had user directories. (My /home directory was the one reported with the largest file size.) Deeper inspection revealed that most of my files seemed to be there, and they were properly named! I was in business!
When my new disk arrived, I installed it and started copying my old files onto the new drive. I was immediately struck by how slow this process was going. It was as if I were transferring the files over a dial-up modem! It didn't help that the IDE subsystem had reset a few times in the process. At this rate the new drive would be out of warranty by the time my file recovery was complete, so I had to do something. It turns out that I had accumulated a lot of files in my home directory that I really didn't need. I had downloaded games and other software and simply built them in my home directory rather than installed them on the system. After I had pruned out all of the files and directories that I didn't care about, I was able to recover the rest of my /home directory.
So there you have it. When I started, I had a dead machine, a failing hard drive, a corrupt partition table, and a corrupt filesystem. When I had finished, I had at least recovered the important files from the system and had been able to carry on my day-to-day work without too much interruption, thanks to the Live CD. But there are some lessons to be learned here, which is why I chose to write about my experience.
I should have backed up yesterday. But for the record, my business files were on my server and I have redundant, off-site backups of them. I was mostly interested in recovering my password wallet, a few pictures and videos that I'd saved, and a few miscellaneous documents. OK, lesson learned.
But there's more. I was grateful to be able to keep running using a Live CD. However, I'm a KDE user and the Ubuntu CD that I had was Gnome-based. I got my work done, but it would have been nice to be in an environment that I was accustomed to using. In the future, I'll be keeping a Knoppix or Kubuntu CD handy.
I also found that my Gentoo CD just wasn't up to the task of system recovery. I'll be burning a genuine recovery disk, as soon as I have a system on which to burn CD's.
I really needed to have a set of emergency CD's handy for this situation. I could see having a CD wallet that had a Live CD, a Recovery CD, and an Installation CD. Having these CD's handy would have saved me a lot of time.
That said, I have to say that I'm glad to have been able to recover my data and that I wasn't down too terribly long in the process. I also wanted to mention how helpful the Linux Community is in times like this. I'm a fairly experienced Linux user, but it was sure nice to be able to ask questions before I actually committed changes to disk. I hope my tale of woe serves as both warning and encouragement to you; stuff happens, and you can recover from it.