Perl and Sockets
Perl works well for writing prototypes or full-fledged applications because it is so complete. The language seldom needs to be extended to do the sorts of things you'd expect to do with C or C++ on a Linux system. One notable example is the Berkeley socket functions, which Perl included even back when the Internet was just a cool bit of technology rather than a global cultural phenomenon.
Sockets are a general-purpose inter-process communication (IPC) mechanism. When processes run on separate machines, they can employ sockets to communicate using Internet protocols. This is the basis for most Internet clients and servers. Many Internet protocols are based on exchanging simple text; so is much of the interesting content. Since Perl excels at processing text, it's ideal for writing applications like web servers or any type of client which parses or searches for text. In this article, we develop a very simple client that searches for regular expressions on specified web sites—a not-so-intelligent agent, you might say.
I assume the reader has no prior knowledge of sockets, but if you have used the socket functions in C, they'll look quite familiar in Perl. The basic functions include socket, connect, bind, listen, and accept. Perl also has versions of functions like gethostbyname and getprotobyname, which make socket communication much easier. These Perl functions, of course, eventually invoke the C versions, so the argument lists are quite similar. The only differences arise because Perl file handles aren't the same as C file descriptors (which are just integers) and the Perl versions don't need the additional lengthy arguments for strings or structures.
We'll discuss the details of the socket functions needed for an Internet client later, but let's first look briefly at the normal sequence of operations for Internet communication. The server first establishes a socket with the socket function, which returns a socket descriptor much like a file descriptor. The server next assigns the socket an address with bind, then tells the system that it is willing to receive connections with the listen function. The accept function can block until a client connects to the server. The client program also calls socket and gets a socket descriptor. The client connects to the address specified by the server's bind call using the connect function. If all goes well, the client can read and write to the socket descriptor just as if it were a file descriptor. Refer to Listing 2 to see how the socket and connect functions are used in a typical program.
As mentioned above, a client program must first call socket to get a socket descriptor or, in the case of Perl, a file handle. This function specifies a particular communications protocol to use and sets up an endpoint for communication—that is, a place to plug in a connection—a “socket”, for lack of a better term. The syntax of this function is:
socket SOCKET, DOMAIN, TYPE, PROTOCOL
SOCKET is the file handle. DOMAIN and TYPE are integers that specify the address domain (or family) and the socket type. In Perl 4, you had to set these numbers explicitly, but Perl 5 defines them in the Socket module. To access the Socket module, add the following line to the top of your program:
use Socket;For Internet applications, set DOMAIN to AF_INET (usually 2) and TYPE to SOCK_STREAM (usually 1). This basically means the address of the server will have the familiar Internet form (e.g., 192.42.55.55) and you'll read from and write to the socket like any I/O stream. You can set the PROTOCOL argument to 0 for most applications, but it's easy to get the correct value with the getprotobyname function.
Next, you need to connect to the server with the connect function. This can get a bit tricky in Perl if you don't have the most recent versions of the Socket module, primarily because it's hard to specify the server's address. The syntax of the connect function is:
connect SOCKET, NAME
SOCKET is the file handle created by the socket function, so that's easy. The NAME argument, however, is described as a “packed network address of the proper type for the socket”, which might leave you scratching your head if you're not already familiar with sockets. For Internet applications, the proper type of network address for the C version of the connect function is given by structures something like those in Listing 1 (from either <netinet/in.h> or <linux/in.h>).
With a bit of scrutiny, you can see you need to pack three pieces of information into a binary structure 16 bytes long. First you need the address family, which is AF_INET, the same as the DOMAIN argument to socket. The second piece is the port number of the server socket. Most common servers have what's called a “well-known” port number (in the case of HTTP servers, this is 80), but an application should have some method of indicating alternate port numbers. Finally, you need to know the Internet address of the server. From the structures above, you can tell this is a 32-bit value. Fortunately, if you know the Internet name of the server (e.g., www.linux.com) you can get the address with the gethostbyname function. Once you've assembled this information, you can create the NAME argument with the Perl pack function. The code might look something like this:
$sockaddr_in = 'S n a4 x8'; $in_addr = (gethostbyname("www.linux.com"))[4]; $server_addr = pack( $sockaddr_in, AF_INET, 80, $in_addr );
Recent versions of Perl (5.002 and later) greatly simplify this whole process with the sockaddr_in function from the Socket module. This function takes the port number and the Internet address of the server and returns the appropriate packed structure. I use this technique in our mini-client in Listing 2. If you need portability, or simply want readability, I strongly recommend using Perl version 5.002 or later.
So we've finally set up our socket and made a connection to the server. Now things get considerably easier because we can treat the socket like any other file handle. The only wrinkle is we want to make sure anything we write to the socket is not buffered, because it needs to get to the server before we can read the server's response. For this we use the Perl select function, which sets the file handle to use for standard output. Note in Listing 2 that the socket file handle is selected; then the special variable $| is set to 1 to force a buffer flush after every write; then STDOUT is re-selected.
Now our client can send a request to the server. This application just sends a GET command to the HTTP server so that it will return the page specified by the URL. Once the command is sent, we read anything arriving at the socket line-by-line and look for the patterns we've specified. You could do anything you wanted with the HTML returned from the server, even parsing it or looking for other hypertext links to follow.
It will come as no shock that there are many aspects of sockets we haven't covered. In particular, I haven't discussed writing servers (mainly to keep this article to a manageable length). If you want to learn more about writing Internet servers in Perl, I recommend reading Programming Perl by Wall, Christiansen, and Schwarz (commonly called “the Camel book”). Perl also contains several socket functions I haven't mentioned, including send and recv, that can be used like write and read calls, and sendto and recvfrom, which are used for so-called “connectionless” communications. Again, see the Camel book for details on these functions, and for network communication in general, I recommend Unix Network Programming by W. Richard Stevens. Also, don't forget that many Perl Internet applications live out there on the Internet already, so look to these for examples. I particularly recommend tinyhttpd, a very compact HTTP server, as a good way to learn how to construct servers (see http://www.inka.de/~bigred/sw/tinyhttpd.html).
Mike Mull writes software to simulate sub-microscopic objects. Stranger still, people pay him to do this. Mike thinks Linux is nifty. His favorite programming project is his 2-year-old son, Nathan. Mike can be reached at mwm@cts.com.