Consumer Rankings
On-line booksellers have changed the way I decide which books to buy. Not only have they made it possible for me to get most books within two to three days, but they have also given me a means of comparison shopping previously unavailable to me. If I am thinking of buying a book, I immediately look to see what others have written about it. In most cases, these evaluations clinch the deal, convincing me to buy a certain book instead of its competition.
I have been thinking more about such systems since my recent move to Modi'in, a new Israeli city halfway between Jerusalem and Tel Aviv. Just before I moved to Modi'in, I was asked to take over a small mailing list for residents of Modi'in. The list gives people a chance to share local announcements and ideas related to life in Modi'in.
Soon after becoming the list administrator, I realized that subscribers often asked for recommendations, from doctors to lawn services to after-school activities. Normally, an e-mail list administrator who sees such constant repetitions will prepare a FAQ, a list of common questions and their answers. But recommendations are extremely subjective, and one person's favorite barber may be someone else's nightmare.
This month, we will look at a set of CGI programs I wrote to allow list subscribers to enter and rank their favorite products and services in town. Because my web space provider does not offer mod_perl, I had to use the CGI standard for writing my programs.
This “ranking” system, as I call it, consists of three CGI programs written in Perl, which use a relational database for back-end storage. (These listings are not printed here due to space considerations, but can be downloaded from the LJ FTP site. See Resources.) I have used MySQL, but there is no reason another relational database, such as PostgreSQL or Oracle, could not be substituted for it. Some of the SQL syntax might have to be changed in order to fit another database server, but the majority should remain the same.
While the ranking system presented this month is not as sophisticated as the one used by Amazon.com nor as flexible as that used by Epinions.com, it does serve a simple purpose. Moreover, it demonstrates how to produce a simple ranking system, which could easily be extended to produce a rough version of Epinions.com.
As always, the first step in creating a database/web application is to consider how we want to store the information, and then to create the tables in our relational database. In this particular case, we will keep things simple, dividing ranked items into categories, but without any hierarchy representing the categories. We will thus be able to keep all restaurants in the same category, but without any distinction between Italian restaurants and French restaurants. Alternatively, we can create two separate categories for Italian and French restaurants, but then the system will see them as unrelated as barbers and vacuum-cleaner salesmen.
We will also associate a name and e-mail address with each ranking. It might be more elegant to place user names in a separate table and refer to them with a numeric key. However, we are less interested in tracking users than in making it possible to find useful consumer information.
Given all this, I decided to implement the ranking system with three tables: RankCategories, RankItems and Rankings. RankCategories, as its name implies, contains the categories into which the items are sorted, and can be defined as follows:
CREATE TABLE RankCategories ( category_id MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, category_name VARCHAR(25) NOT NULL, category_description TEXT NULL, UNIQUE(category_name) );
Notice how each category will have a unique category_id, allowing us to refer to it by number alone. By using the AUTO_INCREMENT feature, MySQL can automatically set this number for us, ensuring that it is unique. For a small web site, a MEDIUMINT is large enough, ranging from 0 to 16,777,215. A large site might eventually have more than 17 million rankings, in which case a larger size, such as INT or BIGINT, might be a good idea.
We also want to ensure that no two categories can have the same name, and thus add a unique constraint to the category_name column as well. Each category can then have some descriptive text associated with it, which is placed in category_description.
The items to be ranked are placed in a similar table, RankItems:
CREATE TABLE RankItems ( item_id MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, category_id MEDIUMINT UNSIGNED NOT NULL, item_name VARCHAR(25) NOT NULL, item_description TEXT NULL, UNIQUE(item_name) );
Once again, each item has a unique value in item_id, and the item name is guaranteed to be unique, thanks to the UNIQUE constraint. However, RankItems adds a category_id column, identifying the category in which this item sits.
RankItems defines the items available to be ranked, but does not store the grades. That role is assigned to the Rankings table, defined as follows:
CREATE TABLE Rankings ( ranking_id MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, item_id MEDIUMINT UNSIGNED NOT NULL, ranker_name VARCHAR(30) NOT NULL, email VARCHAR(40) NOT NULL, entry_datetime TIMESTAMP(8) NOT NULL, comments TEXT NULL, rank TINYINT UNSIGNED NOT NULL, UNIQUE(item_id, email) );
Like the other tables, Rankings gives each ranking a unique primary key, ranking_id. None of our applications use ranking_id, and it is possible that its inclusion in the table definition wastes some space on disk and in memory. However, such a primary key will make it easier to refer to items in Rankings if and when we build additional applications for the ranking system.
Rankings then contains an item_id column, which refers back to the primary key of the same name in RankItems. Because each item is in a single category and RankItems contains a category_id column, there is no need to name the category in Rankings as well.
We ask each ranking user to provide his or her full name and e-mail address. This information is displayed next to a ranking, in order to give it a slightly more human touch. In addition, some people's opinions will carry more weight than others, especially in a small community of users, so it is worthwhile to identify opinions by name.
The ranking consists of a numeric rank between 0 and 10, stored in a TINYINT. It is accompanied by optional (NULL) comments, stored in a TEXT column, where the user can elaborate on his or her ideas.
To ensure each user can rank each product only once, we have MySQL require the combination of item ID and e-mail address to be unique in the Rankings table. By combining the two in this way, the database itself will reject any attempt to enter two rows in which the combination of e-mail address and item_id are identical.
Now that we have seen how the data will be stored, we will write several programs that allow users to enter information into the tables. The interface presented here might seem a bit primitive, but this doesn't really matter. Since the information is stored in a database, we can always write new programs to improve or change the interface.
Since each RankItem must be placed in a category, we must first write a program that adds new categories to the system. Listing 1, rank-category.pl (see Resources), is a simple example of such a program.
When invoked with the GET method, such as when a user enters a URL into a browser, rank-category.pl produces an HTML form into which details about a site can be entered. The form is submitted to rank-category.pl which handles the data submitted with POST, adding a new row into the RankCategories table. After the new row is added to the database, rank-category.pl displays the HTML form once again. This makes it relatively easy to add multiple categories.
rank-category.pl, as well as the other programs presented this month, use CGI.pm, the standard Perl module (written by Lincoln Stein) for working with CGI programs. We also import a number of subroutines into the current name space (grouped together with the :standard tag), making it easier to produce HTML output. The p function, for example, surrounds text with <P> and </P> tags. Thus the following:
print p("Hello");
is the same as
print "<P>Hello</P>";
but is more readable. In addition, CGI.pm's HTML subroutines can work on multiple strings and can be nested:
print p("This will be in", b("bold"), "type");
CGI.pm automatically inserts whitespace between arguments to these subroutines, so there is no need to put spaces at the end or beginning of quoted strings.
To set HTML tag attributes, pass a hash reference to the subroutine as the first argument. For example, the following code comes from rank-category.pl (Listing 1):
print p("Now go ", a({-href=>'/cgi-bin/rank-something.pl'}, "rank something"), "!");
The above code puts the text “rank-something” inside of a hyperlink pointing to /cgi-bin/rank-something.pl, producing the same HTML as the following, without having to worry about quotes or whitespace:
print qq{<P>Now go <a href="/cgi-bin/rank-something.pl"> rank something</a>!</P>By default, CGI.pm does not insert whitespace between HTML tags generated using these routines. This might be more efficient for the computers, but makes it difficult to examine and debug the HTML output. We thus use CGI::Pretty, a subclass of CGI.pm which inserts appropriate whitespace between the tags. (If your system does not support CGI::Pretty, use CGI.pm instead. The output will not be formatted as nicely, but will still work.)
Regardless of how it is invoked, rank-category.pl always produces an HTML form consisting of two text fields, new_category_name and new_category_description. These are submitted back to rank-category.pl, which inserts the information into the database.
The program's connection to the database is managed by DBI, the Perl database interface available from CPAN (the comprehensive Perl archive network, at http://www.cpan.org/). DBI provides a generic API to a relational database, in conjunction with a specific driver (DBD) for each type of database server. DBDs are available for most popular brands of database server and make it possible to port Perl programs across platforms with a minimum of difficulty.
Assuming that new_category_name and new_category_description contain valid data, we insert a new row into the RankCategories table by building up an SQL query. Here is the Perl code that performs this task:
my $sql = "INSERT INTO RankCategories "; $sql .= "(category_name, category_description) "; $sql .= "VALUES (?,?) "; my $sth = $dbh->prepare($sql) || die "Cannot prepare: $DBI::errstr"; my $success = $sth“>execute($new_category_name, $new_category_description);
Notice how $sql contains question marks (?) rather than actual data. These values are set in $sth->execute, with the first argument ($new_category_name) being assigned to the first place holder and the second argument ($new_category_description) assigned to the second.
rank-category.pl does not benefit in an obvious way from the use of place holders, which speed up multiple invocations of the same SQL query, with minor variations. However, using place holders also makes it possible to avoid problems that can arise when interpolated variable values contain ' and " characters. DBI handles and quotes them automatically, making the program more readable and eliminating the possibility of these sorts of errors.
Once one or more categories have been added to RankCategories, users can begin to rank individual items. This is accomplished with Listing 2, rank-something.pl (see Resources), which produces an HTML form so that users can rank an item. Users have the option of ranking something already in the database, or adding a new item inside an existing category.
The form must be created by a program, rather than written as a static document, because it displays pop-up menus of catalogs and items from the database tables. When it is first invoked, rank-something.pl retrieves the items and categories from RankItems and RankCategories, respectively, putting them into hashes for easy retrieval.
DBI makes it relatively easy to retrieve a table into a hash with its fetchrow_hashref method. However, the documentation explicitly states that fetchrow_hashref is not very efficient, so we use fetchrow_arrayref, retrieving each row as its own array reference and storing the information in a hash:
$sql = "SELECT category_id, category_name "; $sql .= "FROM RankCategories "; $sth = $dbh->prepare($sql) || die "Cannot prepare: $DBI::errstr"; $result = $sth->execute || die "Cannot execute: $DBI::errstr"; my %categories = (); while (my $row = $sth->fetchrow_arrayref) { my ($id, $name) = @$row; $categories{$id} = $name; }
The “existing item” and “new item” sections of the form both use the popup_menu function provided by CGI.pm for the creation of <select> lists. A pop-up menu has a name and a set of potential values, each of which can be optionally associated with descriptive text. For example:
<select name="number"> <option value="1">one <option value="2" selected> two <option value="3">three </select>The above HTML creates a three-element pop-up menu named “number”. This menu can pass one of three options (1, 2 or 3) to a CGI program. However, the user never sees the values; instead, the menu is labeled with English words corresponding to the values. While the user can select and submit any of the menu's values, the two element is selected by default.
Given that the existing items are placed in the %items hash (just as the above code places existing categories in the %categories hash), we can create a pop-up menu with the following code:
popup_menu(-name => "existing_item_id", -values => [0, (sort { $items{$a} cmp $items{$b}} keys %items)], -labels => {0 => "Choose one", map {($_, $items{$_})} keys %items}, -default => $existing_item_id, -override => 1),
popup_menu returns a text string, appropriate for passing to print or saving to disk (if the program is creating a file of HTML-formatted text). The name parameter sets the name attribute, and the default parameter indicates which attribute will be selected by default.
The values parameter takes an array reference as an argument. In this case, the values should be the keys of %items (i.e., the primary keys from RankItems), but sorted in the order of each key's item_name value. The solution is to sort keys %items by value, rather than by key. Adding a 0-value element to the front of the values parameter means the first option will always have a value of 0. MySQL does not use 0 in AUTO_INCREMENT columns, and Perl sees 0 as false—so setting the default value to 0 will never collide with an actual value of item_id, and can easily be identified when passed to our program via POST.
By default, popup_menu will set the descriptive text for each <option> to the value itself. However, the labels parameter makes it possible to assign custom labels to some or all values. The labels parameter takes a hash reference as input. Since a hash in list context turns into a comma-separated list of keys and values, we can stick %items into the hash reference, preceded by a mapping from 0 to the text “Choose one”.
Unfortunately, HTML does not support hierarchical pop-up menus. It would be easiest and best for the user if we could use popup_menu to provide a menu of categories that would lead to a list of associated items. Given the choice between requiring a second program invocation (as we did with view-ranking.pl, described below) and producing a flat list of items, I chose the latter. Another approach is to use the method pioneered by Yahoo! and used by Epinions, in which each category hyperlink is followed by links to the most popular individual items. Implementing such an approach would require some changes to rank-something.pl (and presumably view-ranking.pl), but not to the underlying database.
When rank-something.pl is invoked with the POST method, it looks for a new item that it will need to insert into RankItems. The actual insertion is rather straightforward, using DBI's prepare and execute methods, as we have seen before:
$sql = "INSERT INTO RankItems "; $sql .= "(category_id, item_name, item_description) "; $sql .= "VALUES (?,?,?) "; $sth = $dbh->prepare($sql) || die "Cannot prepare: $DBI::errstr"; $success = $sth->execute($item_category, $item_name, $item_description);
If the execution is unsuccessful, $success will be set to a false value. We can trap errors this way, producing an error message like the following:
unless ($success) { print h1("Error inserting new item"); print p( "There was an error inserting the item:". $DBI::errstr"); print p( "Perhaps this item already exists in the". database?"); exit; }We can now be sure the item to be ranked is in RankItems. If the item was already in RankItems, then we knew its primary key from the <select> list in the HTML form. However, how can we retrieve the primary key of the new item just inserted? The DBI driver for MySQL provides an attribute, called mysql_insertid, which returns the primary key of the most recently inserted row. We can retrieve this value and store it in $item_id, which otherwise would get its value from the pop-up menu:
$item_id = $dbh->{"mysql_insertid"};As we saw earlier, each ranking consists of an integer between 0 and 10 (from worst to best), as well as user comments about the item. In order to avoid potential formatting problems, I decided to remove all HTML tags from the comments. Using Perl's non-greedy regular expressions, this is an easy task:
$comments =~ s|<.*?>||g;The above substitutes all occurrences of <, followed by zero or more characters, followed by >, with the empty string. The |g modifier at the end of the s||| operator performs this operation globally.
Next, we handle paragraph separators. We want people to be able to enter more than one paragraph, but cannot let them use <P> and </P> tags to do so. The solution is to treat every occurrence of more than one \r (carriage return) or \n (newline) character as a paragraph separator. UNIX machines rarely produce \r characters in text input, but DOS/Windows systems end lines with a combination of \r\n, and Macintoshes use a single \r. The following turns any two (or more) of these into a <br> tag, followed by two newlines:
$comments =~ s|[\r\n]{2,}|<br>\n\n|g;
Finally, we insert the new ranking into the Rankings table:
$sql = "INSERT INTO Rankings "; $sql .= "(item_id, ranker_name, email, comments, rank) "; $sql .= "VALUES (?,?,?,?,?) "; $sth = $dbh->prepare($sql) || die "Cannot prepare: $DBI::errstr"; $success = $sth->execute($item_id, $ranker_name, $email, $comments, $rank);Once again, we use question marks (?) as place holders, potentially speeding up multiple queries and removing the need for us to explicitly quote the individual items.
If the INSERT is successful, the user is given a short message:
print p("Your ranking was successfully entered.");
Following this message, the user is given the chance to rank another item in the database or view the current rankings database by clicking on a hyperlink.
Finally, once data has been entered into Rankings, we can write Listing 3, view-ranking.pl (see Resources), a program that lets us look through the rankings and read them. There are many ways to present the information, and I took the easy way in this version of the program, forcing users to go through two menus (one of categories and a second of items in that category) before seeing the list of rankings for a particular item. Once again, we use the GET/POST trick to write a program that both creates a form and accepts its data.
If view-ranking.pl is invoked with GET, it creates a simple HTML form with the categories in a pop-up menu:
print $query->start_html(-title => "Choose a category"); print h1("Choose a category"); print startform(-method => "POST", -action => $query->url); print p("Select a category to view:", popup_menu(-name => 'category_id', -values => [sort {$categories{$a} cmp $categories{$b}} keys %categories], -override => 1, -labels => \%categories)); print submit(-value => 'View items in this category'); print endform;
If view-ranking.pl is invoked with POST, it checks to see whether category_id was set. If so, then it assumes the invoking form was the above, and displays a list of items within that category:
$sql = "SELECT C.category_name, I.item_name, "; $sql .= " I.item_description, AVG(R.rank)"; $sql .= "FROM RankItems I, RankCategories C, "; $sql .= " Rankings R "; $sql .= "WHERE I.category_id = C.category_id "; $sql .= "AND I.item_id = $item_id "; $sql .= "AND I.item_id = R.item_id "; $sql .= "GROUP BY I.item_id = R.item_id "; $sth = $dbh->prepare($sql) || die "Cannot prepare: $DBI::errstr"; $result = $sth->execute || die "Cannot execute: $DBI::errstr";This seemingly complex SQL query retrieves information from all three tables, getting the category name from RankCategories and information on this item from RankItems. The AVG function returns the average value from all returned rows, making it possible to get a sense of where the item should truly be ranked. Of course, there is no way to stop a malicious user (or the owner of a ranked business) from trying to skew the scales in a particular direction, so it is just as important for users to read the comments and the individual rankings as the average score.
Once we have retrieved general information on the item, we perform a second SELECT, requesting all rows for this item in chronological order:
$sql = "SELECT ranker_name, email, comments, rank "; $sql .= "FROM Rankings "; $sql .= "WHERE item_id = $item_id "; $sql .= "ORDER BY entry_datetime ";
The results of this query are then printed for the user:
while (my $row_ref = $sth->fetchrow_arrayref) { my ($name, $email, $comments, $rank) = @$row_ref; print p(dt(a({href => "mailto:$email"}, $name), "*" x $rank, " ($rank)"), dd($comments)); }HTML's <dt> and <dd> tags are perfect for formatting these sorts of comments, handling the indentation automatically. Perl's x operator, which multiplies a text string, makes it simple to produce the correct number of stars associated with a particular review.
The ranking software I presented this month is only in its earliest stages and will undoubtedly have improved by the time this issue of Linux Journal reaches your hands.
Along with improvements in the user interface and the possible addition of a hierarchical category system, there are several related applications I would like to add. One would return the highest-ranked items within a category, allowing a user to find the best choice without having to wade through dozens or hundreds of reviews. Hard-core users of this system might want to read only those reviews which have appeared since the last time they visited the site. Some additional personalization, including the use of cookies to remember the user's name and e-mail address, would reduce the amount of typing someone would have to do. An editing facility for system administrators will undoubtedly prove useful as the system grows. Finally, it is always nice to provide a search function, just in case an Italian restaurant is accidentally classified as a café.
The programs we examine this month demonstrate that it is not particularly difficult to produce a simple user ranking system. This is especially true if we take advantage of the power a relational database affords us in storing and retrieving data. Best of all, subscribers to my e-mail list can now spend their time trading information, leaving the recommendations to a set of CGI programs.
Reuven M. Lerner , an Internet and Web consultant, moved to Modi'in, Israel following his November marriage to Shira Friedman-Lerner. His book Core Perl will be published by Prentice-Hall in the spring. Reuven can be reached at reuven@lerner.co.il. The ATF home page, including archives, source code and discussion forums, is at http://www.lerner.co.il/atf/.