Chasing Linux Kernel Archives
Kernel development is truly impossible to keep track of. The main mailing list alone is vast beyond belief. Then there are all the side lists and IRC channels, not to mention all the corporate mailing lists dedicated to kernel development that never see the light of day. In some ways, kernel development has become fundamentally mysterious.
Once in a while, some lunatic decides to try to reach back into the past and study as much of the corpus of kernel discussion as he or she can find. One such person is Joey Pabalinas, who recently wanted to gather everything together in Maildir format, so he could do searches, calculate statistics, generate pseudo-hacker AI bots and whatnot.
He couldn't find any existing giant corpus, so he tried to create his own by piecing together mail archived on various sites. It turned out to be more than a million separate files, which was too much to host on either GitHub or GitLab. He asked the linux kernel mailing list for suggestions on better hosting opportunities. Although he acknowledged, "It's possible I'm the only weirdo who finds this kind of thing useful, but I figured I should share it just in case I'm not."
Joe Perches suggested plumbing the archives at kernel.org/lore.html, which go back decades. But Joey said he'd tried that, and he found it all but impossible to convert those archives to the Mailbox format he wanted. Instead, he'd spent the previous several weeks scraping the lkml.org archive and scripting his own conversion routines.
Konstantin Ryabitsev remarked:
The maildir format is kind of terrible for LKML, because having millions of messages in a single directory is very hard on the underlying FS. If you break it up into multiple folders, then it becomes difficult to search. This is the main reason why we have chosen to go with the public-inbox format, which solves both of these problems and allows for a very efficient archive updating and replication using git.
Meanwhile, Jasper Spaans raised his eyebrows at Joey's statement that he'd gotten more than a million separate files by scraping lkml.org. Jasper said:
First of all, there are more than 3M messages stored in the lkml.org database, so I guess you've missed some messages or something is really broken. Besides, unless you figured out how to get to the raw data, you've just scraped a rendering which discards stuff like pgp signatures etc and has very incomplete headers. Unless you don't care for those of course.
Jasper added that he'd also been working on extracting Maildir-type data out of the lore website, and he sent Joey the code he'd been using to do that.
Eric Wong also sent Joey a script he'd been using to convert slrn threaded Usenet repositories to Maildir; although like others, he recommended against putting millions (and millions) of files into a single directory.
The discussion wasn't headed anywhere; it was just various people sharing knowledge and making judgment calls.
Once upon a time, and a very long time ago it was, I wanted to get a hold of the earliest archives of Linux kernel development discussions. I asked everyone where I could find them, and one of the developers replied that he had a lot of that stuff mixed up in his mail archives, along with all manner of other email messages. I wrote back and eagerly told him I'd love to get my hands on it. He wrote back again, explaining that there was just no way he could take the time to extract the private stuff from the public stuff. And, that was the end of that. I've always wondered why he responded to my initial email in the first place, if he was just going to say no at the end. And, that's the tale of how I came this close to writing up summaries of the very earliest Linux developments.
Note: if you're mentioned above and want to post a response above the comment section, send a message with your response text to ljeditor@linuxjournal.com.