An Easy Way to Avoid Spam

by Cezary M. Kruk

Spam has become a real problem these days. The more people who know your e-mail address, the more unwanted pieces of mail you receive. If you used your e-mail to register on some Web sites or publish a few articles in the newsgroup, you probably started to receive more unwanted messages than welcome ones. Fortunately, the users of Linux and other open-source systems have written a lot of good anti-spam filters. You can find dozens of these programs on GNU.org and Freshmeat. The anti-spam software may be sophisticated or simple, easy or difficult to use, more or less effective. Amidst the larger well-known projects, there is SpamAssassin, written by Justin Mason, and bogofilter, a Bayesian filter written by Eric S. Raymond, and others.

Testmail, the filter discussed in this article, is a Perl filter of average size and moderate complexity. It checks e-mail messages available at the POP3 server, filters them according to defined rules and, depending on the selected method, sends messages to the local mailbox or removes them from the server. Testmail requires the Perl libnet, Net-Ping and Socket modules.

First Steps

To install testmail, run install.sh script as a regular user. The script asks you a few questions about the address of your POP3 host, your user's name at the host, your password and so on. If you use a few mailboxes, you should run install.sh once for each unique name to prepare all the configurations.

At the end of the installation, the script displays information about further steps. After the installation is complete, you should put the addresses of all your trusted senders in ~/.testmail/rul_from_accept file. It also is a good idea to run the testmail --initialize command to prepare some standard files that testmail uses, but it isn't necessary.

An Easy Way to Avoid Spam

Figure 1. The test-*.acce file lists information about all accepted messages.

Assuming you use a_nowak mailbox at wp.pl server, and you prepared the configuration named wp, all you need to do to check you mailbox is run testmail with --test - wp. In this case, the --test parameter means run for testing purposes, and the dash indicates all the messages from the first to the last. If there are eight messages at the server, for example, the dash is the abbreviation of the 1-8 parameter.

An Easy Way to Avoid Spam

Figure 2. The test-*.reje file lists information about all rejected messages at the server.

From here, testmail checks the mailbox, taking into consideration the rules listed in the files in the ~/.testmail directory. It grants the right scores to the particular messages, according to the table of scores kept in the ~/.testmail/cfg_wp file, which is the description of the wp configuration. All the information about the messages available at the server are stored in ~/test-wp.acce (Figure 1), ~/test-wp.reje (Figure 2) and ~/test-wp.log (Figure 3) files. The first one describes the accepted messages, the second lists the rejected ones and the third file includes detailed information about the rules used during the evaluation of the messages.

An Easy Way to Avoid Spam

Figure 3. If you want to know more about the used rules, check test-*.log.

The sample entry in test-*.reje file can be something like this:

                3. [-35]
Date:           Fri, 25 Jul 2003 13:37:16
Return-Path:    <nancibcol@mailclub.net>
From:           <Nancibcol@mailclub.net>
To:             <a.nowak@bigfoot.com>
CC:             <no entries>
Subject:        Feel young a.nowak
Content-Type:   text/html
Content-Length: 1271

The number at the beginning of the first line is the number of that message at the server, 3 in our example. The number in the square brackets is the score of the message, -35 in the example. The rest of the record is the description of some important header fields. The entry in the test-*.acce file looks similar, but the score is positive rather than negative.

The information about the rules used for that message included in test-*.log file is the following:

3 -10 @mailclub.net       <rul_from_reject>
3 +10 a.nowak@bigfoot.com <rul_to_accept>
3 -10 a.?nowak            <rul_subj_reject>
3 -10 feel                <rul_subj_reject>
3 -10 young               <rul_subj_reject>
3  -5 text/html           <rul_type_reject>
3 -35 --------------------------------------

The number in the first column is the number of the message at the server. The numbers in the second column are scores, either positive or negative. The fourth column describes the types of rules used. The third column includes the information about each particular rule.

In detail for our example, @mailclub.net is registered as a rejected domain in the rul_from_reject file. a.nowak@bigfoot.com is registered as an accepted recipient's address in the rul_to_accept file. a.?nowak--the account name on the server--feel and young are registered as rejected subject keywords in the rul_subj_reject file. Finally, text/html is registered as a rejected content type in the rul_type_reject file.

The score for an accepted recipient's addresses is +10, the score for rejected senders and subject's keywords is -10 and the score for rejected content types is -5. Next to the underline is the total of all the subscores, -35 in the example. Evaluation like this means this e-mail message is spam, beyond all doubt.

In Practice

Testmail uses scores and divides messages into positively and negatively estimated ones. It then distinguishes highly scored and lowly scored messages. By default, positively scored messages are expected to be retrieved while negatively scored or zero-scored one are deleted. You can change these default by using the appropriate commands.

If you have created the configuration named wp, the command testmail --extdget - wp pulls all the positively scored messages and deletes low scored ones, that is, all the messages scored -15 and below. Then, the command testmail --test - wp displays information about the messages remaining at the server. If three negatively scored messages are on the server, and you want the second one and do not want the others, use the command testmail --getall 2 wp to get the second message. Then, use the command testmail --del -2 wp to delete the two remaining messages.

All retrieved messages for the wp configuration are stored in ~/mbox-wp mailbox. The information about the retrieved messages is stored in the ~/done-wp.acce file, while information about the deleted ones is in the ~/done-wp.reje file.

If you don't want to deal with those sophisticated commands and files at all, you can run the testmail program using a simple testmail --force - wp command. As a result, the program gets all the positively scored messages as well as the negatively scored ones, up to the value determined by the AUTODEL variable listed in the configuration file (-15, by default). It then removes the rest, all the low scored messages. You end up seeing more spam this way than in the former example, but the entire process is be automated.

An Easy Way to Avoid Spam

Figure 4. Testmail displays help, including all the parameters and information about the configuration.

Advanced Use

Testmail has a lot of features, and you can add your own rules to the configuration files to make the program more effective. You also can switch some variables to make testmail more verbose. The additional shell scripts can help you in processing log files. You also can test new sets of rules across log files of already-received and deleted messages. You even can record some statistics about those messages. Sample statistics, as displayed by the program, include the following:

-------------+-------------
 categories  | messages
-------------+-------------
 total       |      946
 accepted    |      773
 rejected    |      173
 released    |        6
 stopped     |       14
 e-mail      |       82.6%
 spam        |       17.4%
-------------+-------------
 hit         |       97.9%
 missed      |        2.1%
-------------+-------------

You see here the total number of messages included in the log files, divided into the number of messages accepted and rejected by testmail. Released means the messages classified by the program as accepted but that we actually did not want, while stopped signifies the messages classified as rejected that we wanted anyway. E-mail and spam are the percentages of messages in each of these categories. The summary shows the percentage of hit and missed messages. A well-configured testmail achieves about 95% hits and 5% misses.

For more testmail help, run testmail wp, assuming you have the wp configuration (see Figure 4). If you want to know more, read the entire documentation of the program-readme.txt text file and testmail.1 man's page.

Resources

Testmail

Bogofilter

SpamAssassin

Anti-Spam SoftwareGNU.org

Freshmeat.org

Cezary M. Kruk lives in Wroclaw, Poland. He is an editor for the Polish quarterly magazine CHIP Special Linux.

Load Disqus comments