PHP as a General-Purpose Language

by Marco Tabini

If PHP is your scripting language of choice when it comes to developing dynamic Web sites, you probably have grown to love its immediacy and power. An estimated ten million Web sites use at least some PHP scripting to generate their pages.

Although most people use PHP primarily as a Web development scripting system, it possesses all the characteristics of a proper general-purpose language that can be useful in a variety of other environments. In this article, I illustrate how it's possible to use the command-line version of PHP to perform complex shell operations, such as manipulating data files, reading and parsing remote XML documents and scheduling important tasks through cron.

The contents of this article are based on the latest version of PHP at the time of this writing, 4.3.0, which was released at the end of 2002. However, you should be able to use older versions of PHP 4 without many problems. I explain the differences you may encounter as necessary.

PHP-CLI

With the release of PHP 4.3, a new version of the interpreter called command-line interface (or PHP-CLI) is available. PHP-CLI is not a shell as the name implies but, rather, a version of PHP designed to run from the shell. As far as software development is concerned, only a few differences exist between PHP-CLI and its CGI or server API (SAPI) counterparts. For one thing, traditional Apache server variables are not available, as Apache isn't even in the picture, and the HTTP headers are not output when a script is executed. Also, the engine does not use output buffering, because it would be of no benefit in a non-Web environment.

PHP-CLI is created by default when you compile your version of PHP, unless you use the --disable-cli switch when you execute the configuration script. It is not, however, installed by default. But, you can force make to compile it and install it by using a special command:

make install-cli

To verify whether the CLI version of PHP is installed on your server, all you need to do is execute this command:

php -v

The resulting version information should specify whether the CLI or CGI version of PHP is being executed. If you have only the CGI version and don't want to install the CLI, you still can use PHP as a shell-scripting language. Their differences are mostly aesthetic, and their effect can be toned down somewhat by using the right command-line switches when invoking the interpreter.

Parsing an RSS Feed

Being a lover of weblogging, I routinely visit a certain number of blogs on the Net. This is a somewhat tedious process, because I don't like the idea of a news aggregator running on my machine on a continuous basis, and I do not see the need to pay for one. It seemed, though, that an RSS aggregator might be a great way to show how some of PHP's powerful features, such as the fopen() wrappers and the built-in XML parsing engine, could be used to create a script that runs from the command line.

An RSS feed is, essentially, a simple XML document that contains information about items published by a news source, such as Linux Journal. Its format consists of a channel container that includes several optional elements, such as a title and description, in addition to a set of item subcontainers. Each of these, in turn, contains a title, a description and a link to the news story it represents.

Typically, a news aggregator loads the information from an arbitrary number of news feeds and presents everything together in a given format, such as HTML. For users, a news aggregator represents a convenient way to create a single point of information for all the news sources of interest.

My PHP-based news aggregator, called Feeder and shown in Listing 1, presents its results in a plain-text e-mail that is sent to the user, who then executes the script. Feeder loads a list of RSS feeds from a file located in ~/.feeder.rc (Listing 2). The first line of this file also contains the e-mail address to which the news feed data should be sent. The content of the configuration files are loaded using a simple trick: the back-tick operator, which performs exactly the same function as it does in the shell, is used to call the cat command. The output is then split into an array of individual lines using the explode function.

Listing 1. Feeder, an RSS Aggregator


<?php

// Classes used internally to parse the XML
// data

class CItem
{
  var $title;
  var $description;
  var $url;
}

class CFeed
{
  var $title;
  var $url;

  var $items;

  var $currentitem;
}

// XML handlers

function ElementStarter($parser, $name, $attrs)
{
  global $currentelement;
  global $elements;

  $elements[$currentelement ++] = $name;
}

function ElementEnder($parser, $name)
{
  global $elements;
  global $currentelement;
  global $currentfeed;

  if ($name == 'ITEM')
  {
    $currentfeed->items[] =
           $currentfeed->currentitem;
    $currentfeed->currentitem = new CItem;
  }

  $currentelement--;
}

function DataHandler ($parser, $data)
{
  global $elements;
  global $currentelement;
  global $currentfeed;

  switch ($elements[$currentelement - 1])
  {
  case  'TITLE' :

      if ($elements[$currentelement - 2] == 'ITEM')
        $currentfeed->currentitem->title .= $data;
      else
        $currentfeed->title = $data;

    break;

  case  'LINK'  :

    if ($elements[$currentelement - 2] == 'ITEM')
      $currentfeed->currentitem->url .= $data;
    else
      $currentfeed->url .= $data;

    break;

  case 'DESCRIPTION'    :

    if ($elements[$currentelement - 2] == 'ITEM')
      $currentfeed->currentitem->description
                  .= $data;
    else
      $currentfeed->description .= $data;

    break;
  }
}

// Feed loading function

function get_feed ($location)
{
  global $elements;
  global $currentelement;
  global $currentfeed;

  $xml_parser = xml_parser_create();

  $elements = array();
  $currentelement = 0;
  $currentfeed = new CFeed;
  $currentfeed->currentitem = new CItem;

  xml_parser_set_option
    ($xml_parser, XML_OPTION_CASE_FOLDING, true);
  xml_set_element_handler
    ($xml_parser, "ElementStarter", "ElementEnder");
  xml_set_character_data_handler
    ($xml_parser, "DataHandler");

  if (!($fp = fopen($location, "r")))
    return 'Unable to open location';

  while ($data = fread($fp, 4096))
  {
    if (!xml_parse($xml_parser, $data, feof($fp)))
      return 'XML PARSE ERROR';
  }
  xml_parser_free($xml_parser);

  return $currentfeed;
}

// Feed formatting function

function format_feed ($feed, $url)
{

  if (!is_object ($feed))
  {
    $res = "Error loading feed at: $url.\n" .
           "$feed\n\n";
  }
  else
  {
    $res = "{$feed->title}\n[{$feed->url}]\n\n";

    foreach ($feed->items as $item)
    {
      $res .= "{$item->title}\n[{$item->url}]\n\n" .
        wordwrap ($item->description, 70) . "\n\n" .
        str_repeat ('-', 70) . "\n\n";
    }
  }

  return $res;
}

// Load up configuration file

$data = explode ("\n", trim (`cat ~/.feeder.rc`));

// The first line is the address, so skip it

$result = 0;

// Cycle through and get all the feeds

for ($i = 1; $i < count ($data); $i++)
  $result .= format_feed
    (get_feed ($data[$i]), $data[$i]);

// Mail them out to the user

mail ($data[0], 'Feeder update', $result);

?>


Listing 2. The Configuration File for Feeder



// Feed formatting function
function format_feed ($feed, $url)
{
   ob_start();

   if (!is_object ($feed)) {
   ?>
      <p>
      <b>Unable to load feed at
      <a href="<?= $url ?>"?>
      <?= htmlentities($url) ?></a></b></p>

      <?php

   } else {
   ?>

      <h1><a href="<?= $feed->url ?>">
      <?= $feed->title ?></a></h1>
      <p />

      <?php
      foreach ($feed->items as $item) {
      ?>

         <h2><a href="<?= $item->url ?>">
         <?= htmlentities ($item->title) ?></a></h2>
         <div width=500>
         <?= htmlentities ($item->description) ?>
         <hr></div>
       <?php
       }
   }

   $res = ob_get_contents();
   ob_clean();

   return $res;
}

The parsing of the XML feed happens in two phases. First, the get_feed function uses the fopen() wrappers to download the feed in 4KB chunks. These are then passed on to an instance of the built-in PHP XML parser, which proceeds to interpret their contents and call ElementStarter(), ElementEnder() and DataHandler(), as needed. These three functions, in turn, parse the contents of the XML file and create a structure of CFeed and CItem instances that represents the feed itself. The script then calls the format_feed function, which scans feed objects and produces a textual version of their contents. Once all the feeds have been parsed and formatted, the resulting message is e-mailed to the intended recipient.

As a security note, format_feed() uses the wordwrap function to format the description of a news item so it doesn't span more than 70 columns. This helps enhance the readability of the news feed by presenting the user with a more compact look. Prior to PHP 4.3.0, the source code for wordwrap() included an unchecked data buffer that could, in theory, be exploited to execute arbitrary code, thus presenting a security issue. If you're not using the latest version of PHP, you probably should either avoid using wordwrap() or replace it with your home-grown version.

Executing the Script

The easiest way to execute a script from the shell is to invoke the PHP interpreter explicitly:

marcot ~# php feeder.php

If you have the CGI version of PHP, you may want to use the -q switch, which causes the interpreter to omit any HTTP headers that are normally required during a Web transaction.

This explicit method, however, is not very practical if you want your users to access the scripts you write conveniently. A better solution consists of making the scripts executable, so they can be invoked explicitly, as if they were autonomous programs. To do this, first determine the exact location of your PHP executable:

marcot ~# which php
/usr/local/bin/php

The next step consists of creating a shebang—an initial command that instructs the shell interpreter to pipe the remainder of an executable file through a specific application (the PHP engine in our case). The shebang must be the first line of your script—there can't be any white spaces before it. It starts with the character # and the character !, followed by the name of the executable through which the remainder of the file must be piped. For example, if you're using the CLI version of PHP, your shebang may look like this:

#!/usr/local/bin/php

If you're using the CGI version of the PHP interpreter, you also can pass additional options to it in order to keep it quiet and prevent the usual HTTP headers from being printed out:

#!/usr/local/bin/php -q

The final step consists of making your script executable:

marcot ~# chmod a+x feeder.php

At this stage, you can run the script without explicitly invoking the PHP interpreter; the shell will take care of that for you.

As you may have noticed, I have not renamed the script to remove the .php extension. Even though the extension itself is not necessary when running scripts from the shell, its presence makes it easy for text editors such as vim to recognize it and highlight the source's syntax:

marcot ~# ./feeder.php
Running PHP Scripts through Cron

A news aggregator that must be invoked explicitly every time you want to read your news page is not very useful. Therefore, you may want to have your system run it automatically on a specific schedule. The cron dæmon generally is used for this purpose. cron is a simple dæmon that runs in the background and, at fixed intervals, reads through a special file, called crontab, that contains schedule specifications for each of the users on the server. Based on the information contained in the crontab file, cron executes an arbitrary number of shell commands and, optionally, sends an e-mail notification of their results to the user. The crontab file contains entries in the following format:

minute hour
day month
weekday command

The first five fields indicate the time or times at which a command must be executed. For example:

5 9 13 9 1 /usr/bin/feeder.php

means that at 9:05 AM of September 13, the command /usr/bin/feeder.php will be executed, but only if September 13 falls on a Monday (weekday 1). This may sound complicated, but it's an extreme example. Most likely, you want to execute commands on a simpler schedule, like the beginning of every hour. This is accomplished by using the * wild card, which means any. So, for once an hour, on the hour, you would enter:

0 * * * * /usr/bin/feeder.php

And for once a day, at midnight, enter:

0 0 * * * /usr/bin/feeder.php

The time fields allow for even more complex specifications. For example, you can create a list of specific times by separating them with a comma:

0,30 * * * * /usr/bin/feeder.php

This crontab specification causes the command /usr/bin/feeder.php to be run every 30 minutes starting from the hour. Similarly, you can specify inclusive lists of times by separating them with a dash. For example, the following crontab command:

0 0 * * 1-3 /usr/bin/feeder.php

causes the script to be executed at midnight, Monday through Wednesday.

In order to change the contents of your crontab file, you need to use the crontab utility, which also automatically edits the correct file and notifies the dæmon that your schedule has changed. There aren't any special requirements to run a PHP script as a cron job, as long as it does not expect any input from a user.

Manipulating HTML Code

Even though your PHP-CLI scripts are not outputting HTML through a Web server, you still can use them to manipulate and produce HTML code. Because the script is written rather modularly, converting its output to HTML format involves changing only the format_feed function and modifying the call to mail(). This is done so the e-mail message can be recognized as a valid HTML document by the user's e-mail application.

One of the greatest advantages of scripting Web pages with PHP is the ability to mix dynamic statements directly with the static HTML code. As you can see from Listing 3, which shows an updated version of format_feed, this concept still works perfectly even when the script is not outputting to a Web page.

Listing 3. A Version of the format_feed Function that Produces HTML


// Feed formatting function

function format_feed ($feed, $url)
{

  ob_start();

  if (!is_object ($feed))
  {
  ?>
    <p>
    <b>Unable to load feed at
    <a href="<?= $url ?>"?>
    <?= htmlentities($url) ?></a></b></p>
  <?php
  }
  else
  {
    ?>

    <h1><a href="<?= $feed->url ?>">
    <?= $feed->title ?></a></h1>
    <p />

    <?php

    foreach ($feed->items as $item)
    {
    ?>
        <h2><a href="<?= $item->url ?>">
        <?= htmlentities ($item->title) ?></a></h2>
        <div width=500>
        <?= htmlentities ($item->description) ?>
        <hr>
        </div>
    <?php
    }
  }

  $res = ob_get_contents();
  ob_clean();

  return $res;
}


The trick that makes it possible to capture PHP's output in a variable essentially consists of engaging the interpreter's output buffer (disabled by default) by calling ob_start(). Once the appropriate information has been output, the script retrieves the contents of the buffer, then erases it and turns output buffering off with a call to ob_end().

Where to Go from Here

Although the news aggregator script I present in this article performs a rather complex set of functions—from grabbing content off the Web to parsing XML and formatting it in HTML—it requires only about 200 lines of code, including all the comments and blank lines. It is possible to write the same script in Perl or even as a shell script, with the help of some external applications such as wget, expat and sendmail. The latter approach, in my opinion, results in a complicated code base with plenty of opportunities for mistakes.

PHP-CLI rarely is installed by default on a machine running Linux, although you can count on Perl being readily available. Thus, if you have control over the make-up of the server on which you're running scripts and you're comfortable with PHP, there's no reason why you need to learn another language to write most of your shell applications. If, on the other hand, you're writing code to run on a separate machine over which you have no control, you may find PHP a slightly more problematic choice.

Marco Tabini is an author and software consultant based in Toronto, Canada. His company, Marco Tabini & Associates, Inc., specializes in the introduction of open-source software in enterprise environments. You can reach Marco through his weblog at blogs.phparch.com.

Load Disqus comments