Internet Radio to Podcast with Shell Tools
It all started because I wanted to listen to “Hour of the Wolf” on WBAI radio—it's a cool science-fiction radio program hosted by Jim Freund that features readings, music, author interviews and good “I was there when...” kind of stories. Unfortunately for me, WBAI broadcasts from Long Island, New York, and is too far away from me to receive well. Plus, the show is on Saturday mornings from 5 to 7AM EST—not a really welcoming timeslot for us working folks.
Then, I discovered that WBAI has streaming MP3 audio on its Web site, which solved the reception problem. That left the Oh-Dark-Hundred problem—I'm normally settling into a deep sleep at that hour. And science-fiction buff or no, I'm not going to be catching Jim live any time soon.
What I needed was a VCR for Internet radio. Specifically, I wanted to capture the stream and save it to disk as an MP3 file, named with the show name and date. I would need to add the proper MP3 ID tags so I could load it into my Neuros audio player for convenient listening. It also would be awfully nice if I could let RSS-compatible software know that I've captured these files. That way, they would show up in a Firefox live bookmark or could be transferred to an iPod during charging. The ultimate effect would be to create an automatic podcast—a dynamically updated RSS feed with links to saved recordings—by snipping a single show out of an Internet media stream at regular intervals.
So, off I went to Google to search for “mp3 stream recording” and “tivo radio” and so on. I found many packages and Web sites, but nothing seemed quite right. Then, I heard a voice from my past—that of the great Master Foo in Eric S. Raymond's “The Rootless Root”, which said to me: “There is more UNIX-nature in one line of shell script than there is in ten thousand lines of C.” So, I wondered if I could accomplish the task using the tools already on the system, connected by a simple shell script.
You see, I already could play the stream by using the excellent MPlayer media player software. Due to patent problems, Fedora Core 3 doesn't ship with MP3 support, so I previously had downloaded and built MPlayer from source as part of the process of MP3-enabling my system. On a side note, MPlayer makes extensive use of the specific hardware features of each different CPU type, so it performs much better as a video player if it is built from source on the machine where you plan to use it. The command:
mplayer -cache 128 \ -playlist http://www.2600.com/wbai/wbai.m3u
served admirably to play the stream through my speakers. All that was left to do was convince MPlayer to save to disk instead. The MPlayer man page revealed -dumpaudio and -dumpfile <filename>, which work together to read the stream and silently save it out to disk, forever and ever. There's no time-out, so it captures until you kill the MPlayer process. Therefore, I wrote this script:
#!/bin/bash mplayer -cache 128 \ -playlist http://www.2600.com/wbai/wbai.m3u \ -dumpaudio -dumpfile test.mp3 & # the & sets the job running in the background sleep 30s kill $! # kill the most recently backgrounded job
which nicely captured a 30-or-so-second MP3 file to disk. The & character at the end of the mplayer command above is critical; it makes MPlayer run as a background task, so the shell script can continue past it to the next command, a timed sleep. Once the sleep is done, the script then kills the last backgrounded task, ending the recording. You may need to adjust the -cache value to suit your Internet connection or even substitute -nocache.
Now that part one was accomplished, I was on to part two—inserting the MP3 ID tags. Back on Google, I found id3v2, a handy little command-line program that adds tags to an MP3 file—and it's already in the Fedora Core distribution! It's amazing, the things that are lurking on your hard drive.
I now had the tools in place to capture and tag my favorite shows. With that in place, I was left with the task of coming up with some way to make a syndication feed from the stack of files. It turns out that RSS feeds are simple eXtensible Markup Language (XML) files that contain links to the actual data we want to feed, whether that be a Web page or, as in this case, an MP3 file.
Another quick look at Google brought me to the XML::RSS module for Perl. It's a complete set of tools that both can create new RSS files and add entries to existing ones. At this point, I thought I was almost done and put together a nice code example that almost worked. In true project timeline tradition, however, the last 5% of the project turned out to require 95% of the total time.
Once I had a script that did all I wanted it to do, I sent it in to LJ along with a first version of this article. LJ Editor in Chief Don Marti pointed out that I was missing one key component: my program was generating an RSS version 1.0 feed, but all the podcast-aware programs look for a version 2.0 feed—specifically for an XML tag named enclosure. Naturally, I assumed it would be a trivial change to my software, merely switching versions and adding the enclosure tag. I soon learned, however, that the XML::RSS Perl module can write RSS 2.0 but cannot read it. Several sleepless nights ensued, until I determined that Perl tools were available that could read RSS 2.0 but not write it. So, it was time to add some glue.
I started by adding two Perl modules to my system—you can install them (as root) with:
perl -e "install XML::RSS,XML::Simple" -MCPAN
You probably will be okay with answering any questions it asks with the default. If you haven't used the Comprehensive Perl Archive Network (CPAN) yet, it asks quite a few setup questions, such as choosing several mirror sites that are close to you. Otherwise, it simply asks about a dependency or two; say yes.
After the two modules and their required dependencies are installed, you need to create a new XML file with information about the show you want to capture. The great thing about XML is you can use any text editor to make a file that is readable by both humans and machines, making it easy to create, view, test and modify RSS feed files. Let's start with this skeleton, containing a basic title section:
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0"> <channel> <title>Hour of the Wolf</title> <link>http://www.hourwolf.com</link> <description>Science Fiction Talk Radio with Jim Freund</description> <generator>WBAI Stream Capture using Linux shell tools</generator> </channel> </rss>
If you never have played with XML before, this is a good time to get your feet wet. A quick look at the file shows data items surrounded by HTML-like tags, where each <something> tag has a corresponding </something> to close the something section. This becomes more confusing later, though, when we add the alternate syntax, which looks like <tagname a=“A” b=“B” />.
Once I had gathered all the tools I needed, I added a few droplets of shell magic to arrive at this simple script:
#!/bin/bash # catchthewolf - capture "Hour of the Wolf" # For capturing the stream DATE=`date +%F` # Save the date as YYYY-MM-DD YEAR=`date +%Y` # Save just the year as YYYY FILE=/home/phil/wolf.$DATE.mp3 # Where to save it STREAM=http://www.2600.com/wbai/wbai.m3u DURATION=2.1h # enough to catch the show, plus a bit #DURATION=30s # a quick run, just for testing # For the RSS syndication XML="/home/phil/wolfrss.xml" # file for the RSS feed ITEMS=15 # Maximum items in RSS list XTITLE="Hour of the Wolf - $DATE Broadcast" XDATE=`date -R` # Date in RFC 822 format for RSS i=\$i;o=\$o;m=\$m # replace "$" in the perl script # For the id3v2 Tags AUTHOR="Jim Freund" ALBUM="WBAI Stream Rip" TITLE="Hour of the Wolf - $DATE" # Use mplayer to capture the stream # at $STREAM to the file $FILE /usr/local/bin/mplayer -really-quiet -cache 128 \ -dumpfile $FILE -dumpaudio -playlist $STREAM & # the & turns the capture into a background job sleep $DURATION # wait for the show to be over kill $! # end the stream capture # Tag the resulting captured .mp3 id3v2 -a "$AUTHOR" -A "$ALBUM" \ -t "$TITLE" -y $YEAR -T 1/1 -g 255 \ --TCON "Radio" $FILE # Add a new entry in the rss file, # keep the file to a max of $ITEMS entries, # and change the file's date to right now. /usr/bin/perl -e "use XML::RSS; use XML::Simple; \ $i=XMLin('$XML');$o=$i;bless $o,XML::RSS; \ $m=$i->{channel}{item};if((ref $m)ne ARRAY) \ {$o->add_item(%$m);} else \ {foreach $m (@{$m}) {$o->add_item(%$m);}} \ $o->channel(lastBuildDate=>'$XDATE', \ pubDate=>'$XDATE'); \ $o->add_item(title=>'$XTITLE', \ link=>$o->{'channel'}{'link'}, \ pubDate=>'$XDATE', \ enclosure=>{url=>'file://$FILE', \ length=>(stat('$FILE'))[7], \ type=>'audio/mpeg'}, mode=>'insert'); \ pop(@{$o->{'items'}}) \ while (@{$o->{'items'}}>$ITEMS); \ $o->{encoding}='UTF-8'; $o->save('$XML');" echo "Caught the wolf."
This doesn't look too simple, though. Let's dissect this script a bit to see how it all works. Notice the back-ticks (`) around the date commands. They take whatever is enclosed in the `` marks and run it as a command and then replace the entire `whatevercommand` with the output from that command. If I had needed the date only once, I could have written:
FILE=wolf.`/bin/date +%F`.mp3
or even:
/usr/local/bin/mplayer -dumpaudio \ -dumpfile "wolf.`/bin/date +%F`.mp3" \
But because I wanted the date for the filename, the tag and the RSS feed, I stored it in the $DATE shell variable. That makes it much easier to change the script around too. I now have several scripts that capture streams, and the only things that have to change are the variable assignments at the top.
Back-ticks are one of the shell's tools that allow us to merge simple commands into powerful assemblies. You can play with this more by using the echo command. Try, for example:
echo "wolf.`date +%F`.mp3"
to see what the filename would be in that last call to MPlayer.
We use the +%F formatting option to date, because the default date string is full of spaces. Also, my USA locale's date string has / characters in it—not the best thing to try to put inside a filename. Furthermore, the yyyy-mm-dd format means the files sort nicely by date when you list the directory. The RSS feed wants its date in RFC 822 format, so we wind up calling /bin/date three times in all.
Notice also that I'm giving the exact path to some of the executable commands. I do this so that when the script runs as a timed task, it won't have my personal shell's path settings. If you're unsure where a file lives, find it with which:
[phil@asylumhouse]$ which date /bin/date
You're safe to leave off /bin and /usr/bin, but any other path should be specified explicitly, as should paths to any executable that exists as different versions in multiple locations.
The call to id3v2 tags the file as track 1 of 1, with proper author, album, title and year entries. The predefined genre number of 255 means Other. The --TCON entry fills in Radio in place of one of the predefined genres on any software that understands version 2 MP3 tags.
Lastly, the one-line Perl script at the end is a compressed version of this:
#!/usr/bin/perl use XML::RSS; use XML::Simple; $in=XMLin('/home/phil/wolfrss.xml'); $out=$in; # copy the parsed RSS file's tree bless $out, XML::RSS; # make the copy an XML::RSS # blessing doesn't copy the items. Drat! $item = $in->{channel}{item}; if ((ref $item) ne ARRAY) { # only one item in feed $out->add_item(%$item); } else { # a list of items - foreach the list foreach $item (@{$item}) { $out->add_item(%$item); } } # Encoding doesn't transfer either. $out->{encoding}='UTF-8'; # Date the file so client software knows it changed $date = `date -R`; $out->channel( lastBuildDate=>'$date', pubDate=>'$date'); # Add our newest captured file $file = "/home/phil/wolfcaught.mp3"; $out->add_item( title => "Hour of the Wolf", link => $out->{'channel'}{'link'}, pubDate => '$date', enclosure => { url=>"file://$file", length => (stat($file))[7], type => 'audio/mpeg' }, mode => 'insert'); # Don't have more than 15 items in the podcast while (@{$out->{'items'}} > 15) { pop(@{$out->{'items'}}; } # Write out the finished file $out->save('/home/phil/wolfrss.xml');"
Here I use XML::Simple to read and parse the existing .RSS file and XML::RSS to add our new item and write the modified version. The bless function tells Perl that the XML::Simple object $out now should be treated as an XML::RSS object. The only reason this does anything useful is the two modules use nearly identical variable names internally, derived from the tag names of the incoming RSS file.
This bless function copies over almost anything in the RSS file's header, but it doesn't bring over item or encoding tags. So I then copied over each item in a foreach loop, added today's date as the build and publication date and added the just-captured file as a new item. This item has a Web page link that is copied from the header, today's date as publication date and the all-important enclosure tag. The enclosure has a URL, in this case a file:// reference, because we are doing everything on the local filesystem. It also has a file length and a MIME type, audio/mpeg.
Shell variables replace all the quoted strings, and the super-sneaky shell variables $i, $o and $m get replaced by \$i, \$o and \$m. In other words, everywhere you see $i in the Perl script, the Perl interpreter actually gets the Perl variable name $i. Without that bit of substitution, the shell would replace each $i with a null string or, worse yet, whatever the shell variable i happened to hold before the script was executed. The reference to the actual MP3 file is a URL, file:///home/phil/wolf.2005-03-19.mp3, not merely a filename. When we enter the RSS feed file into Firefox or a feed aggregator program, we refer to it using URL notation as well, file:///home/phil/wolfrss.xml.
It may seem strange that I'm calling a scripting language from another scripting language. The point is that I'm using each to do the things it's best at. Bash is designed to execute commands, and it's really easy to start a background process, find out its process ID and kill it again. On the other hand, trying to add an XML entry in Bash using the more basic string-handling tools, such as sed and grep, would have been, well, exactly the kind of thing that drove Larry Wall to write Perl in the first place.
Now that we have a script, we make the file executable and run it:
chmod +x catchthewolf ./catchthewolf
which results in a properly tagged MP3 file and a new entry in the wolfrss.xml RSS feed. When testing, you can uncomment the 30-second test line to make sure everything's working properly, but be sure to comment it back out before trying to catch a show.
Now all that's left is to get our computer to run this thing at 5AM on Saturday. That's done by using the system's cron utility—invoke crontab -e— and adding an entry like this:
MAILTO=phil # Testing: mail script output to me # Catch hour of the wolf 5AM Saturdays 59 4 * * sat /home/phil/catchthewolf
crontab's editor is most likely to be set to vi-style commands, so you have to use i to start typing and <Esc>:wq to save-and-exit. When you're done, you should see this message:
crontab: installing new crontab
which says you're all set. Check man 5 crontab for more information on how to make jobs repeat every day, once a month or whatever. You also want to make sure your user name is in the file /etc/cron.allow—the list of who can run jobs on the system's scheduler. If you're running on a remote system, verify with the administrators that you're allowed to run cron jobs.
To see the resulting podcast, point your RSS-aware software at the XML file the script creates. In Firefox, use Bookmarks→Manage Bookmarks→Add Live Bookmark, and remember to enter the URL starting with file:// and not the filename itself.
By taking two programs already on the hard drive, downloading two Perl modules and writing a few lines of shell script, we have assembled a homebrew Webcast recording system that saves our favorite programs for us to listen to whenever we choose. It also lets us know what it has done by popping up live bookmarks in Firefox and automatically transfers the recordings to our MP3 player. Some scripts for capturing other Internet radio shows will be available on the Linux Journal FTP site (see the on-line Resources). Now I just have to remember to delete the older files before my hard drive fills up with leftover Webcasts.
Thanks to Anne Troop, Jen Hamilton and Chris Riley for their many shell-scripting hints over the years; to Anne's friend Janeen Pisciotta for finding “Hour of the Wolf” for us in the first place; and to LJ Editor in Chief Don Marti for the cool podcast idea.
Streaming Formats
When streaming radio first came out, it often was transmitted in proprietary data formats, making it tough for Linux users to listen. Now most streams are MP3, but there still may be something in a different format that you want to capture, such as BBC Radio's RealPlayer streams—see the on-line Resources for a link. Assuming that it's something MPlayer can handle, we simply can rearrange our process a bit. Tell MPlayer to write audio data to the disk in the form of a WAV file and then encode it using lame for MP3 or oggenc for ogg files. Be aware, though, that lame is not included with Fedora, again due to patent issues.
The audio capture commands then would look like:
# Use mplayer to capture the stream # at $STREAM to the file $FILE /usr/local/bin/mplayer -really-quiet -cache 500 \ -ao pcm:file="$FILE.wav" -playlist $STREAM & # the & turns the capture into a background job sleep $DURATION # wait for the show to be over kill $! # kill the stream capture # Encode to .ogg, quality 2, and tag the file oggenc -q 2 -t $TITLE -a $AUTHOR -l $ALBUM \ -n "1/1" -G "Radio" -R 16000 -o $FILE $FILE.wav rm $FILE.wav # Remove the raw audio data file
followed by the original call to the Perl script. No need to use id3v2 here, as both the lame and oggenc encoders insert tags as part of the encoding process. We wind up with the same result as capturing an MP3 stream directly. But because of the intermediate WAV file's large size, we need much more disk space during the actual capture process. The optional -R 16000 specifies the sample rate of the captured WAV file—this is needed only if MPlayer does not correctly detect the speed of the incoming audio stream and your captured MP3 sounds like whale song or chipmunks. You probably want to comment out the rm command until you're sure the encoding is working the way you want it to and remove the WAV files manually until then.
What Is This Thing Called RSS?
RSS stands for Rich Site Summary.
RSS stands for RTF Site Summary.
RSS stands for Really Simple Syndication.
Everything else about RSS is as confused as its acronym. The idea started out as the ability to read headlines from Web sites without having to download the entire front page. RSS is implemented in eXtensible Markup Language (XML), which makes it easily read and written by both humans and computers. That means the format for the RSS file is standardized—unfortunately, the content is not. There are at least four versions of RSS floating around—0.9, 0.91, 1.0 and 2.0—that have similarities, differences and interoperability issues galore. The basic RSS file contains a title, a publication date and a group of items. Each item has its own title, date and link to the file containing the article content. The variations between versions mean that any software wanting to read or write these files has to be programmed specifically to understand each version—there is not enough backward compatibility to let things simply work.
Even the version numbering is odd—version 2.0 is descended from version 0.91, not version 1.0. Version 1.0 is the most feature-rich and extensible, supporting dynamic definitions of the tag names through links to special machine-readable Web pages. Version 2.0 extends the original concept to allow more complex summaries that include images and music rather than only lines of text; it does so through the use of the enclosure tag. Enclosures work like attachments to e-mail messages. When the RSS-aware program downloads the site summary, it notices the attachments and downloads them too. This extends the concept of a summary to being a list of contents, plus the contents itself—far from the original concept of RSS, but this is becoming its biggest use today.
Resources for this article: /article/8402.
Phil Salkie is an industrial controls guru who has liked science fiction and radio drama since childhood. He has been a Linux fanatic since 2.0.12 or so and has the most wonderful, tolerant family—e-mail him at phil@asylumhouse.org.