At the Forge - Bloglines Web Services, Continued
I am writing this column a few days after the November 2, 2004, elections in the United States. As an admitted political junkie, I enjoy the modern era of computerized, always-on punditry. No longer must I switch TV stations or read several newspapers at the local library; now, I can follow the sound bites as they pass from the candidates to the press to the various partisan sites.
Keeping up with many different news and opinion sites can consume quite a bit of time. As we have seen over the last few months, everyone has benefited from the creation of news aggregators—programs that read the RSS and Atom syndication feeds produced by Weblogs, newspapers and other frequently updated sites. An aggregator, as its name suggests, takes these feeds and puts them into a single, easily accessible listing.
Bloglines.com is an Internet startup that provides a Web-based news aggregator. In and of itself, this should not surprise anyone; the combination of syndication, aggregation and the Web made this a natural idea. And, Bloglines isn't unique; there are other, perhaps lesser-known, Web-based news aggregators.
One unique service that Bloglines offers its subscribers, however, is the ability to use Bloglines' internal database to create their own news aggregators or their own applications built from the data Bloglines has collected. This information is available without charge, under a fairly unrestrictive license, to any programmer interested in harvesting the results of Bloglines' engine. The fact that Bloglines checks for updates on hundreds of thousands of blogs and sites approximately every hour means that someone using the Web services API can be assured of getting the most recent Weblog content.
Last time [LJ, January 2005], we looked at the Notifier API, which provides access to a particular user's available-but-unread feeds. We also discussed the Blogroll API, which allows users to determine and use programmatically, if they wish, a list of people who are pointing to a feed. As we saw, these APIs made it easy for us to find out that new Weblog entries were available or to create our own custom aggregation page listing Weblogs of interest.
Something was missing in the functionality that we exposed in that article, however. It's nice to know that new Weblog entries are among my Bloglines subscriptions, but it would be even nicer to know which blogs had been updated. And, it's nice to get a list of my current subscriptions, but I would be much happier to find out which of them have been updated—and to find out when they were most recently updated, how many new entries are in each Weblog and what those entries contain. In other words, I want to be able to replace the current Bloglines interface with one of my own, displaying new Weblog entries in a format that isn't dictated by the Bloglines.com Web site.
Luckily, the Web services developers at Bloglines have made it possible to do exactly this by way of the sync API. This month, we continue our exploration of Bloglines Web services, looking in detail at the sync API it provides. We also are going to create a simple news aggregator of our own, providing some of the same features as the Bloglines interface.
At the end of the day, a news aggregator such as Bloglines simply is a list of URLs. Indeed, the Python-based news aggregator we created two months ago using the Universal Feed Parser was precisely such a program—it looked at a set of URLs in a file and retrieved the most recent items associated with those URLs. Each individual Weblog posting must be associated with one of the URLs on a list. Removing a URL from the subscription lists makes its associated postings irrelevant to that user and invisible to them.
The fact that Bloglines has multiple users rather than a single user means it must keep track of not only a set of different URLs, but also which URL is associated with each user. Although this obviously complicates things somewhat, modern high-level languages make the difference between these two data structures easily understood. Rather than simply storing a list of URLs, we must create a hash table, in which the key is a user ID and the value is the list associated with that particular user. Once we have the user's unique ID, we easily can keep track of that particular user's subscriptions.
Of course, Bloglines is keeping track of subscriptions not for a few thousand users, but for many tens or hundreds of thousands of users. Thus, it is safe to assume they are not using such a naive implementation, which would suffice for a small experiment or an aggregator designed for a small number of people. Things get a bit trickier when you approach Bloglines' user load. Each user's list of subscriptions can't be a simple URL; it is more likely to be an ID number (or primary key, in database jargon) associated with a URL. Such a system gives multiple participants the chance to subscribe to a site's syndication feed and allows Bloglines to suggest new Weblogs that they might enjoy, based on their current subscriptions.
It thus should come as no surprise to learn that retrieving new Weblog postings from Bloglines is a two-step process, with the first step requiring us to retrieve a list of subscriptions. That is, we first ask Bloglines for a list of subscription IDs associated with a user. We then ask Bloglines to send us all of the new items for this user and this subscription ID.
Implementations of the Bloglines Web services API are available in several different languages. Because Perl is my default language for creating new applications, I am going to use the WebService::Bloglines module that has been uploaded to CPAN, the Comprehensive Perl Archive Network, a worldwide collection of Web and FTP servers from which Perl and its modules can be retrieved. For example, Listing 1 contains a simple program (bloglines-listsubs.pl) that displays the title, subscription ID and URL for each of a user's subscriptions. A number of additional values are available for each of the subscriptions; the documentation for WebService::Bloglines, as well as the Bloglines API documentation, lists these in detail.
Listing 1. Display a User's Subscriptions
#!/usr/bin/perl use strict; use diagnostics; use warnings; use WebService::Bloglines; my $username = 'reuven@lerner.co.il'; my $password = 'MYPASS'; my $bloglines = WebService::Bloglines->new(username => $username, password => $password); # Do we want to mark them as read? my $mark_unread = 0; # From what date do we want to download items? # (This should be in Unix "time" my $subscriptions = $bloglines->listsubs(); if ($subscriptions) { # list all feeds my @feeds = $subscriptions->feeds(); # Get each feed's title and URL foreach my $feed (@feeds) { my $title = $feed->{title}; my $url = $feed->{htmlUrl}; my $subId = $feed->{BloglinesSubId}; print "Subscribed to '$title', " . "subId '$subId' at '$url'\n"; } } else { print "No subscriptions.\n" }
If you are interested in preserving the subscription hierarchy the Bloglines.com interface gives users, you might want to examine the folders function, rather than the feed function used in Listing 1. Although feed returns a flat list of subscriptions, folders keeps things organized as they exist on the Bloglines site.
Now that we know how to retrieve the subscription IDs associated with a particular Bloglines user, we can retrieve individual items associated with a particular subscription ID. For example, Listing 2 is a short program that retrieves all of a user's subscriptions and then displays all of the newly updated items for each. The output is in plain-text format, not in HTML, which means the displayed link is not clickable. But, it would not be particularly difficult to run such a program in a cron job and dump its output into an HTML file, thereby giving an up-to-the-minute personalized list of feeds. Of course, Bloglines provides such a service at no cost whenever you might want to check its Web site. So, although such a program is an interesting use of the Bloglines Web services, it doesn't have a compelling use outside of those services.
Listing 2. bloglines-getitems.pl
#!/usr/bin/perl use strict; use diagnostics; use warnings; use WebService::Bloglines; my $username = 'reuven@lerner.co.il'; my $password = 'MYPASS'; my $bloglines = WebService::Bloglines->new(username => $username, password => $password); # Do we want to mark them as read? my $mark_unread = 0; # From what date do we want to download items? # (This should be in Unix "time" my $subscriptions = $bloglines->listsubs(); if ($subscriptions) { # list all feeds my @feeds = $subscriptions->feeds(); foreach my $feed (@feeds) { my $title = $feed->{title}; my $url = $feed->{htmlUrl}; my $subId = $feed->{BloglinesSubId}; print "Subscribed to '$title', " . "subId '$subId' at '$url'\n"; my $update; # Trap errors! eval {$update = $bloglines->getitems($subId);}; # Keep track of errors, showing "no change" if ($@) { if ($@ =~ /^304 No Change/) { print "\t No change\n"; } else { print "\t Error code '$@' " . "retrieving updates.\n"; } } # No errors? Show some basics about the items. else { foreach my $item ($update->items) { my $title = $item->{title}; my $creator = $item->{dc}->{creator}; my $link = $item->{link}; my $pubDate = $item->{pubDate}; print "\t$title by $creator " . "on $pubDate ($link)\n"; } } } } else { print "No subscriptions.\n" }
One of the clever things that Bloglines has done in its Web services definition is to use HTTP return codes to indicate errors and unusual circumstances. For example, the 200 (OK) response code indicates that new items may be read and that getitems($subId) contains one or more such data structures. The 304 (unchanged) response code, which normally indicates a page of HTML has not changed since it last was requested, here has a slightly different function; it indicates that a particular subscriber already has seen all of the available items for this subscription. Other response codes (401, 403 and 410) indicate authentication errors and probably mean that the requesting user has made a mistake in typing the Bloglines user name, password or both.
Unfortunately, Perl's handling of such response codes is less than optimal. In order to handle them, we must invoke $bloglines->getitems() inside of an eval block and check for a non-empty value of $@ immediately after the eval. If $@ is empty, we can assume that we received a 200 (OK) HTTP response code and there are new items to read. But if it contains a value, we then can rewrite the output message, as we did in Listing 2 . If we fail to trap this method call within an eval block, however, our program will die with a fatal runtime error the first time we receive anything other than a 200 response code.
Finally, two optional parameters make the Bloglines functionality complete. The first, known as n, is a simple true-or-false (1 or 0) value that tells Bloglines if it should update the already-seen bit for the articles it is sending to you. Normally, when a user is viewing Weblog postings with the Bloglines.com Web interface, this is set to 1, meaning you do not see any already-seen articles a second time. Perhaps because they knew the Web services API currently supplements other news aggregation applications, Bloglines wisely changed the default to 0 in this API.
The second optional parameter, known as d, tells Bloglines the first date from which you would like to download a particular site's postings. The value is in UNIX time format, meaning that you send the number of seconds since January 1, 1970. This number is readily available with the time function in most major languages, and it allows you to indicate with great precision exactly how far back you want to delve into a particular site's history, as stored by Bloglines.
To be honest, I am an enthusiastic Bloglines user without being sure exactly where the site and company are headed. I cannot imagine that it will continue to be free of charge and of any advertising indefinitely, unless its investors are highly charitable or extremely naive. I enjoy its fine interface, the fact that I easily can access the Weblogs on which I have depended for political insight—or screaming, depending on how you interpret such punditry—and its speedy, robust functionality.
But as Amazon, eBay and Google have demonstrated over the last few years, providing a Web services interface to your core data opens the door to many new creative applications that a company's internal developers never think to create. Bloglines is only beginning to expose its functionality with Web services, and although it has taken only an initial and tentative step in this direction, what I have seen appears to be promising. I look forward to seeing applications that will be built on top of this API, as well as the additional APIs that Bloglines and its competitors will offer in an attempt to make Bloglines the central site for Weblogs, readers and developers alike.
Resources for this article: www.linuxjournal.com/article/7961.
Reuven M. Lerner, a longtime Web/database consultant and developer, now is a graduate student in the Learning Sciences program at Northwestern University. His Weblog is at altneuland.lerner.co.il, and you can reach him at reuven@lerner.co.il.