Using Perl to Check Web Links
One of the first things I did when I got my first Internet account was put together my own set of web pages. The one I get the most comments about is called “Weirichs on the Web” where I link to other Weirichs I have found on the Web. Although a lot of fun, keeping the links up to date can be very tedious. As web pages that I reference are moved or deleted, links to them become stale. Without constant checking, it is difficult to keep my links current.
So, I began to wonder, is there a way to automatically find the outdated links in a web page? What I needed was a script that would scan all of my web pages and report every bad HTML link along with the web page on which it was used.
There are several parts to this problem. Our script must be able to:
fetch a web document from the Web
extract a list of URLs from a web document
test a URL to see if it is valid
We could write code by hand to extract URLs and validate them, but there is a much easier way. LWP is a Perl library (available from any CPAN archive site) designed to make accessing the World Wide Web very easy in Perl. LWP uses Perl objects to provide Web-related services to a client. Perl objects are a recent addition to the Perl language and many people might not be familiar with them.
Perl objects are references to “things” that know what class they belong to. These “things” are usually anonymous hashes but you don't need to know this to use an object. Classes are packages that provide the methods the object uses to implement its behavior. And finally, a method is a function (in the class package) that expects an object reference (or sometimes a package name) as its first argument.
If this sounds confusing, don't worry. Using objects is very easy. LWP defines a class called HTTP::Request that represents a request to be sent on the Web. The request to GET a document at URL http://w3.one.net/~jweirich can be created with the statement:
$req = new HTTP::Request GET, 'http://w3.one.net/~jweirich';
new creates a new Request object initialized with the GET and http://w3.one.net/~jweirich parameters. This new object is assigned to the $req variable.
Calling a member function of an object is equally straightforward. For example, if you want to examine the URL for this request, you can invoke the url method on this object.
print "The URL of this request is: ", $req->url, ",\n";
Notice that methods are invoked using the -> syntax. C++ programmers should feel comfortable with this.
All the knowledge about fetching a document across the Web is stored in a UserAgent object. The UserAgent object knows how long to wait for responses, how to handle errors, and what to do with the document when it arrives. It does all the hard work—we just need to give it the right information so that it can do its job.
use LWP::UserAgent; use HTTP::Request; $agent = new LWP::UserAgent; $req = new HTTP::Request ('GET', 'http://w3.one.net/~jweirich/'); $agent->request ($req, \&callback);
This snippet of Perl code creates a UserAgent and a Request object. The Request method of UserAgent issues the request and calls a subroutine called callback with a chunk of data from the arriving document. The callback subroutine may be called many times until the complete document has been received.
We could use regular expressions to parse the incoming document to determine the location of all the links, but when you begin to consider that HTML tags may be broken across several lines and all the little variations involved, it becomes a more difficult task. Fortunately, there is an HTML parsing object available in the LWP library, called HTML::LinkExtor, which extracts all the links from an HTML document.
The parser is created and then fed pieces of the document until it reaches the end of the document. Whenever the parser detects links buried in HTML tags, it calls another callback subroutine that we provide. Here is an example that extracts and prints all the links in a document.
use HTML::LinkExtor $parser = new HTML::LinkExtor (\&LinkCallback); $parser->parse ($chunk); $parser->parse ($chunk); $parser->parse ($chunk); $parser->eof; sub LinkCallback { my ($tag, %links) = @_; print join ("\n", values %links), "\n"; }
We now have all the tools we need to build our checklinks script. We will define two operations for URLs. When we scan a URL, we will fetch the document (using a UserAgent) and scan it for internal HTML links. Every new link we find will be added to a list of URLs to be checked.
Next, check a link to see if it points to a valid web document. We could try retrieving the entire document to see if the document exists, but the HTTP protocol defines a HEAD request that gets only the document's date, length and a few other attributes. Since a HEAD request can be much faster than a full GET for large documents, and since it tells us what we need to know, we will use the head() function of the LWP::Simple package to check a URL. If head() returns an undefined value, then the document specified by the URL cannot be fetched and we add the URL to a list of bad URLs. If head() returns a list, the URL is valid and it is added to the list of good URLs. Finally, if the valid URL points to a page in our local web space and ends with “.html” or “.htm”, we add the URL to a list of URLs to be scanned.
The scanning process produces more URLs to be checked. Checking these URLS produces more URLs that need to be scanned. As URLs are checked, they are moved to the good or bad list. Since we restrict scanning to URLs in our local web space, eventually we will scan all local URLs that are reachable from our starting document.
When there are no more URLs to be scanned and all URLs have been checked, we can print the list of bad URLs and the list of files that contain them.
The complete code to checklinks is found in Listing 1. You will need Perl 5 to be able to run the checklinks routine. You will also need a recent copy of the LWP library. When I installed LWP, I also had to update the IO and Net modules. You can find Perl, and the LWP, IO and Net modules at http://www.perl.com/perl.
You can invoke checklinks on a single URL with the command:
checklinks url
If you wish to scan all local URLs reachable from the main URL, add a -r option.
Running checklinks on my home system against my entire set of web pages took about 13 minutes to complete. Most of that time was spent waiting for the bad URLs to timeout. It scanned 76 pages, checked 289 URLs, and found 31 links that were no longer valid. Now all I have to do is find the time to clean up my web pages!
Jim Weirich is a software consultant for Compuware specializing in Unix and C++. When he is not working on his web pages, you can find him playing guitar, playing with his kids, or playing with Linux. Comments are welcome at jweirich@one.net or visit http://w3.one.net/~jweirich.