Working with LWP
Most of the time, this column discusses ways in which we can improve or customize the work done by web servers. Whether we are working with CGI programs or mod_perl modules, we are usually looking at things from the server's perspective.
This month, we will look at LWP, the “library for web programming” available for Perl, along with several associated modules. The programs we will write will be web clients, rather than web servers. Server-side programs receive HTTP requests and generate HTTP responses; our programs this month will generate the requests and wait for the responses generated by the server.
As we examine these modules, we will gain a better understanding of how HTTP works, as well as how to use the various modules of LWP to construct all sorts of programs that retrieve and sort information stored on the Web.
HTTP, the “hypertext transfer protocol”, makes the Web possible. HTTP is one of many protocols used on the Internet and is considered a high-level protocol, alongside SMTP (the simple mail transfer protocol) and FTP (file transfer protocol). These are considered high-level protocols because they sit on a foundation of lower-level protocols that handle the more mundane aspects of networking. HTTP messages don't have to worry about handling dropped packets and routing, because TCP and IP take care of such things for it. If there is a problem, it will be taken care of at a lower level.
Dividing problems up in this way allows you to concentrate on the important issues, without being distracted by the minute details. If you had to think about your car's internals every time you wanted to drive somewhere, you would quickly find yourself concentrating on too many things at once and unable to perform the task at hand. By the same token, HTTP and other high-level protocols can ignore the low-level details of how the network operates, and simply assume the connection between two computers will work as advertised.
HTTP operates on a client-server model, in which the computer making the request is known as the client, and the computer receiving the request and issuing a response is the server. In the world of HTTP, servers never speak before they are spoken to—and they always get the last word. This means a client's request can never depend on the server's response; a client interested in using a previous response to form a new request must open a new connection.
Given all of that theory, how does HTTP work in practice? You can experiment for yourself, using the simple telnet command. telnet is normally used to access another computer remotely, by typing:
telnet remotehost
That demonstrates the default behavior, in which telnet opens a connection to port 23, the standard port for such access. You can use telnet to connect to other ports as well, and if there is a server running there, you can even communicate with it.
Since HTTP servers typically run on port 80, I can connect to one with the command:
telnet www.lerner.co.il 80
I get the following response on my Linux box:
Trying 209.239.47.145... Connected to www.lerner.co.il. Escape character is '^]'.Once we have established this connection, it is my turn to talk. I am the client in this context, which means I must issue a request before the server will issue any response. HTTP requests consist, at minimum, of a method, an object on which to apply that method, and an HTTP version number. For instance, we can retrieve the contents of the file at / by typing
GET / HTTP/1.0This indicates we want the file at / to be returned to us, and that the highest-numbered version of HTTP we can handle is HTTP/1.0. If we were to indicate that we support HTTP/1.1, an advanced server would respond in kind, allowing us to perform all sorts of nifty tricks.
If you pressed return after issuing the above command, you are probably still waiting to receive a response. That's because HTTP/1.0 introduced the idea of “request headers”, additional pieces of information that a client can pass to a server as part of a request. These client headers can include cookies, language preferences, the previous URL this client visited (the “referer”) and many other pieces of information.
Because we will stick with a simple GET request, we press return twice after our one-line command: once to end the first line of our request, and another to indicate we have nothing more to send. As with e-mail messages, a blank line separates the headers—information about the message—from the message itself.
After typing return a second time, you should see the contents of http://www.lerner.co.il/ returned to you. Once the document has been transferred to your terminal, the connection is terminated. If you want to connect to the same server again, you may do so. However, you will have to issue a new connection and a new request.
Just as the client can send request headers before the request itself, the server can send response headers before the response. As in the case with request headers, there must be a blank line between the response headers and the body of the response.
Here are the headers I received after issuing the above GET request:
HTTP/1.1 200 OK Date: Thu, 12 Aug 1999 19:36:44 GMT Server: Apache/1.3.6 (UNIX) PHP/3.0.11 FrontPage/3.0.4.2 Rewrit/1.0a Connection: close Content-Type: text/html
The above lines are typical for a response.
The first line produces general information about the response, including an indication of what is yet to come. First, the server tells us it is capable of handling anything up to HTTP/1.1. If we ever want to send a request using HTTP/1.1, this server will allow it. After the HTTP version number comes a response code. This code can indicate a variety of possibilities, including whether everything went just fine (200), the file has moved permanently (301), the file was not found (404), or there was an error on the server side (501).
The numeric code is typically followed by a text message, which gives an indication of the meaning behind the numbers. Apache and other servers might allow us to customize the page displayed when an error occurs, but that customization does not extend to this error code, which is standard and fixed.
Following the error code comes the date on which the response was generated. This header is useful for proxies and caches, which can then store the date of a document along with its contents. The next time your browser tries to retrieve a file, it will compare the Date: header from the previous response, retrieving the new version only if the server's version is newer.
The server identifies itself in the Server: header. In this particular case, the server tells us not only that it is Apache 1.3.6 running under a form of UNIX (in this case, Linux), but also some modules that have been installed. My web-space provider has chosen to install PHP, FrontPage and Rewrit; as we have seen in previous months, mod_perl is another popular module for server-side programming, and one which advertises itself in this header.
As we have seen, an HTTP connection terminates after the server finishes sending its response. This can be extremely inefficient; consider a page of HTML that contains five IMG tags, indicating where images should be loaded. In order to download this page in its entirety, a web browser has to create six separate HTTP connections—one for the HTML and one for each of the images. To overcome this inefficiency, HTTP/1.1 allows for “persistent connections”, meaning that more than one document can be retrieved in a single HTTP transaction. This is signalled with the Connection header, which indicated it was ready to close the connection after a single transaction in the example above.
The final header in the above output is Content-type, well-known to CGI programmers. This header uses a MIME-style description to tell the browser what kind of content to expect. Should it expect HTML-formatted text (text/html)? Or a JPEG image (image/jpeg)? Or something that cannot be identified, which should be treated as binary data (application/octet-stream)? Without such a header, your browser will not know how to treat the data it receives, which is why servers often produce error messages when Content-type is missing.
HTTP/1.0 supports many methods other than GET, but the main ones are GET, HEAD, and POST. GET, as its name implies, allows us to retrieve the contents of a link. This is the most common method, and is behind most of the simple retrievals your web browser performs. HEAD is the same as GET, but quits after printing the response headers. Sending a request of
HEAD / HTTP/1.0
is a good way to test your web server and see if it is running correctly.
POST not only names a path on the server's computer, but also sends input in name,value pairs. (GET can also submit information in name,value pairs, but it is considered less desirable in most situations.) POST is usually invoked when a user clicks on the “submit” button in an HTML form.
Now that we have an understanding of the basics behind HTTP, let's see how we can handle requests and responses using Perl. Luckily, LWP contains objects for nearly everything we might want to do, with code tested by many people.
If we simply want to retrieve a document using HTTP, we can do so with the LWP::Simple module. Here, for instance, is a simple Perl program that retrieves the root document from my web site:
#!/usr/bin/perl --w use strict; use diagnostics; use LWP::Simple; # Get the contents my $content = get "http://www.lerner.co.il/"; # Print the contents print $content, "\n";
In this particular case, the startup and diagnostics code is longer than the program. Importing LWP::Simple into our program automatically brings the get function with it, which takes a URL, retrieves its contents with GET, and returns the body of the response. In this example, we print that output to the screen.
Once the document's contents are stored in $content, we can treat it as a normal Perl scalar, albeit one containing a fair amount of text. At this point, we could search for interesting text, perform search-and-replace operations on $content, remove any parts we find offensive, or even translate parts into Pig Latin. As an example, the following variation of this simple program turns the contents around, reversing every line so that the final line becomes the first line and vice versa; and every character on every line so that the final character becomes the first and vice versa:
#!/usr/bin/perl -w use strict; use diagnostics; use LWP::Simple; # Get the contents my $content = get "http://www.lerner.co.il/"; # Print the contents print scalar reverse $content, "\n";
Note how we must put reverse in scalar context in order for it to do its job. Since print takes a list of arguments, we force scalar context with the scalar keyword.
There are times, however, when we will want to create more sophisticated applications, which in turn require more sophisticated means of contacting the server. Doing this will require a number of different objects, each with its own task.
First and foremost, we will have to create an HTTP::Request object. This object, as you can guess from its name, handles everything having to do with an HTTP request. We can create it most easily by saying:
use HTTP::Request; my $request = new HTTP::Request("GET", "http://www.lerner.co.il");
where the first argument indicates the request method we wish to use, and the second argument indicates the target URL.
Once we have created an HTTP request, we need to send it to the server. We do this with a “useragent” object, which acts as a go-between in this exchange. We have already looked at the LWP::Simple useragent in our example programs above.
Normally, a useragent takes an HTTP::Request object as an argument, and returns an HTTP::Response object. In other words, given $request as defined above, our next two steps would be the following:
my $ua = new LWP::UserAgent; my $response = $ua->request($request);
After we have created an HTTP::Response and assigned it to $response, we can perform all sorts of tests and operations.
For starters, we probably want to know the response code we received as part of the response, to ensure that our request was successful. We can get the response code and the accompanying text message with the code and message methods:
my $response_code = $response->code; my $response_message = $response->message;
If we then say:
print qq{Code: "$code"\n}; print qq{Message: "$message"\n};we will get the output:
Code: "200" Message: "OK"This is well and good, but it presents a bit of a problem: How do we know how to react to different response codes? We know that 200 means everything was fine, but we must build up a table of values in order to know which response codes mean we can continue, and which mean the program should exit and signal an error.
The is_success method for HTTP::Response handles this for us. With it, we can easily check to see if our request went through and if we received a response:
if (!$response->is_success) {print "Success. \n";} else {print "Error: " . $response->status_line . "\n"; }
The status_line method combines the output from code and message to produce a numeric response code and its printed description.
We can examine the response headers with the headers method. This returns an instance of HTTP::Headers, which offers many convenient methods that allow us to retrieve individual header values:
my $headers = $response->headers; print "Content-type:", $headers->content_type, "\n"; print "Content—length:", $headers->content_length, "\n"; print "Date:", $headers->date, "\n print "Server:", $headers->server, "\n";
Of course, the Web is not very useful without the contents of the documents we retrieve. HTTP::Response has only one method for retrieving the content of the response, unsurprisingly named content. We can thus say:
my $content = $response->content;
At this point, we are back to where we were with our LWP::Simple example earlier: We have the content of the document inside of $content, which stores the text as a normal string.
If we were interested in using HTTP::Request and HTTP::Response to reverse a document, we could do it as shown in Listing 1. If you were really interested in producing a program like this one, you would probably stick with LWP::Simple and use the get method described there. There is no point in weighing down your program, as well as adding all sorts of method calls, if the object is simply to retrieve the contents from a URL.
The advantage of using a more sophisticated user agent is the additional flexibility it offers. Whether that flexibility is worth the tradeoff in complexity will depend on your needs.
For example, many sites have a robots.txt in their root directory. Such a file tells “robots”, or software-controlled web clients, which sections of the site should be considered out of bounds. These files are a matter of convention, but are a long-standing tradition which should be followed. Luckily, LWP includes the object LWP::RobotUA, which is a user agent that automatically checks the robots.txt file before retrieving a file from a web site. If a file is excluded by robots.txt, LWP::RobotUA will not retrieve it.
LWP::RobotUA also differs from LWP::UserAgent in its attempt to be gentle to servers, by sending only one request per minute. We can change this with the delay method, although doing so is advisable only when you are familiar with the site and its ability to handle a large number of automatically generated requests.
Once we have retrieved the content from a web site, what can we do with it? As demonstrated above, we can print it out or play with the text. But many times, we want to analyze the tags in the document, picking out the images, the hyperlinks or even the headlines.
In order to do this, we could use regular expressions and m//, Perl's matching operator. But an easier way is to use HTML::LinkExtor, another object that is designed for this purpose. Once we create an instance of HTML::Extor, we can then use the parse method to retrieve each of the tags in it.
HTML::LinkExtor works differently from many modules you might have used before, in that it uses a “callback”. In this case, a callback is a subroutine defined to take two arguments—a scalar containing the name of the tag and a hash containing the name,value pairs associated with that tag. The subroutine is invoked each time HTML::LinkExtor finds a tag.
For example, given the HTML
<input type="text" value="Reuven" name="first_name" size="5">
our callback would have to be prepared to handle a scalar of value input, with a hash that looks like
(type => "text", value => "Reuven", name => "first_name", size => "5")Listing 2
If we are interested in printing the various HTML tags to the screen, we could write a simple callback that looks like Listing 2. How do we tell HTML::LinkExtor to invoke our callback subroutine each time it finds a match? The easiest way would be for us to hand callback to the parse method as an argument.
Perl allows us to pass subroutines and other blocks of code as if they were data by creating references to them. A reference looks and acts like a scalar, except that it can be turned into something else. Perl has scalar, array and hash references; subroutine references fit naturally into this picture as well. HTML::LinkExtor will dereference and use our subroutine as we have defined it.
We turn a subroutine into a subroutine reference by prefacing its name with \&. Perl 5 no longer requires that you put & before subroutine names, but it is required when you are passing a subroutine reference. The backslash tells Perl we want to turn the object in question into a reference. If &callback is defined as above, then we can print out all of the links in a document with the following:
my $parser = HTML::LinkExtor->new(\&callback); $parser->parse($response->content);
Note that $content might have all HTML links that were returned with the HTTP response. However, that response undoubtedly contained some relative URLs, which will not be interpreted correctly out of context. How can we accurately view the link?
HTML::LinkExtor takes that into account, and allows us to pass two arguments to its constructor (new), rather than just one. The second argument, which is optional, is the URL from which we received this content. Passing this URL ensures all URLs we extract will be complete. We must include the line
use URI::URL;
in our application if we want to use this feature. We can then say
my $parser = HTML::LinkExtor->new(\&callback, "http://www.lerner.co.il/"); $parser->parse($response->content);and our callback will be invoked for each tag, with a full, absolute URL even if the document contains a relative one.
Our version of &callback above prints out all links, not just hyperlinks. We can ignore all but “anchor” tags, which allow us to create hyperlinks by modifying &callback slightly, as shown in Listing 3.
With all of this under our belts, we will write an application (Listing 4) that follows links recursively until the program is stopped. This sort of program can be useful for checking links on your site or harvesting information from documents.
Our program, download-recursively.pl, starts at the URL called $origin and collects the URLs contained within it, placing them in the hash %to_be_retrieved. It then goes through each of those URLs one by one, collecting any hyperlinks that might be contained within them. Each time it retrieves a URL, download-recursively.pl places it in %already_retrieved. This ensures we will not download the same URL twice.
We create $ua, our instance of LWP::RobotUA, outside of the “while” loop. After all, our HTTP requests and responses will be changing with each loop iteration, but our user agent can remain the same throughout the life of the program.
We go through each URL in %to_be_retrieved in a seemingly random order, taking the first item returned by keys. It is obviously possible to sort the keys before taking the first element from the resulting list or to do a depth-first or breadth-first search through the list of URLs.
Inside the loop, the code is as we might expect: we create a new instance of HTTP:Request and pass it to $ua, receiving a new instance of HTTP:Response in return. Then we parse the response content with HTML::LinkExtor, putting each new URL in %to_be_retrieved, but only on the condition that it is not already a key in %already_retrieved.
You may find it interesting to let this program go for a while, following links from one of your favorite sites. The Web is all about linking; see who is linking to whom. You might be surprised by what you find.
Our whirlwind tour of HTTP and LWP stops here. As you can see, there is not very much to learn about either of them; the trick is to use them to create interesting applications and avoid the pitfalls when working with them. LWP is a package I do not often need, but when I do need it, I find it indispensable.
Reuven M. Lerner is an Internet and Web consultant living in Haifa, Israel. His book Core Perl will be published by Prentice-Hall later this year. Reuven can be reached at reuven@lerner.co.il. The ATF home page is at http://www.lerner.co.il/atf/.