Spam Filtering with Bogofilter

Note: This document describes how to set up automatic e-mail filtering with a software package called "bogofilter." Our primary suggested method for spam filtering is SpamAssassinTM which is described at http://pubpages.unh.edu/notes/spamassassin.html. I suggest you try SpamAssassin-based filtering before attempting to use bogofilter. Bogofilter-based filtering is very effective (especially in combination with SpamAssassin), but it requires somewhat more effort on your part than does using SpamAssassin alone.

General Filtering Guidelines

People often ask for a method to handle unwanted e-mail (aka ``spam'') sent to their CIS Unix accounts. Spam is a waste of time and computing resources, and people who send spam (aka ``spammers'') are often sleazy types peddling shoddy products and services, often of dubious legality.

Worse, high volumes of spam can make it difficult to deal with your non-spam mail. (Following convention, we'll call your non-spam mail ``ham'' in the following.)

As a matter of policy, CIS Unix users are considered to be responsible for making their own decisions on mail they do and don't want to read. CIS doesn't want to be the business of reading your incoming messages and deciding whether you would find them uninteresting or offensive. But we do want to give you the tools that minimize the hassle of dealing with such messages.

The procmail program on the CIS Unix systems is used to deliver mail to CIS Unix users. Its normal operation is to put all incoming messages in your default INBOX. But you can also have a .procmailrc file in your home directory that will alter this operation based on the content of the incoming messages. This process is called ``filtering;'' the .procmailrc file contains one or more ``rules'' to automatically divert certain messages to other mailboxes, forward them to other addresses, or delete them entirely.

No automatic process can (yet) substitute for your personal judgment on whether any given message is spam or not. Inevitably, even carefully designed filters will have ``false positives'' (messages that the filter thinks look like spam, but aren't), and ``false negatives'' (messages that the filter thinks don't look like spam, but are).

So please note: CIS will not be responsible for incorrect classification of incoming messages, either false positives or false negatives.

For this reason, this method doesn't throw away suspicious mail without giving you a chance to read it (although an easy change will do that). Instead, suspicious mail is diverted to mailboxes other than your normal INBOX. The idea is that you zip through other mailboxes at lower priority, with the expectation that the messages are almost certainly all junk.

About Bogofilter

Bogofilter relies on the fact that spam you receive ``looks'' very different from your ham mail. It checks the words in the document against your personal database of ``spam words'' and ``ham words'', and uses a statistical calculation to compute the likelihood that the message is spam or ham.

Bogofilter needs to be initialized, trained, and tuned using terminal sessions, which means you probably need to use a terminal-based mail client like pine or Mutt. And you will need to be able to use a Unix editor program, like pico, vi, or emacs. Once bogofilter has been tuned, however, it doesn't need a lot of further tweaking, and so any mail software (WebMail, for example) should be fine.

In the step-by-step descriptions below, we'll assume that the mailer you use is pine, and that the editor you prefer for editing files is pico. Substitute appropriately for your own case.

Setting Up Filtering

You begin by telling procmail to show incoming messages to bogofilter, and to deliver to different mail folders based on the results.

Create (or edit) your .procmailrc file with any Unix editor; we'll use pico in this example:

    % pico ~/.procmailrc
    

And put the following magic lines therein:

    :0fw
    | bogofilter -u -e -p

    :0e
    { EXITCODE=75 HOST }

    :0:
    * ^X-Bogosity: Spam, tests=bogofilter
    mail/IN.spam

    :0:
    * ^X-Bogosity: Unsure, tests=bogofilter
    mail/IN.unsure

Please note that punctuation and spacing is extremely important in a .procmailrc file. All lines should start in the leftmost column. Save the file and exit from the editor.

Note: if you already have a .procmailrc file, the above lines would typically go at the end, to be applied after the existing rules have been worked through. (Specifically, if you've already set up SpamAssassin filtering, then the above lines would go after the lines that test for SpamAssassin results.)

In English, what these lines do is:
  1. Run bogofilter (with appropriate options) on each incoming message.
  2. If there's an error, the mail is requeued for later delivery.
  3. If bogofilter has marked the message as spam, it will be delivered to the IN.spam folder in your mail/ directory.
  4. If bogofilter is uncertain whether the message was spam or ham, it will be delivered to the IN.unsure folder in your mail/ directory.
  5. Otherwise, the mail (assumed to be ham) will be delivered to your normal INBOX.

Training

The above steps cause your incoming mail to be processed by bogofilter. What you now need to do is ``train'' bogofilter to recognize what you consider to be spam messages, and what you consider to be ham messages.

You need to be able to ``tell'' bogofilter that particular messages are spam or not; you can do this from within pine, while you're reading the messages or looking at the message index. You set this functionality up by enabling Unix pipe commands in your pine configuration file. From pine's main menu:

  1. Type 'S' (Setup)
  2. Tyoe 'C' (Config)
  3. Use the down-arrow key to highlight 'enable-unix-pipe-cmd' (which is in a list under Advanced Command Preferences).
  4. If necessary, ``set'' this preference by typing 'X'. (There should then be an X in the box to the left of the preference.)
  5. Type 'E' (Exit Setup)
  6. Type 'Y' (Save Changes

Optionally, you can do initial training of your personal bogofilter database by classifying messages you've already received as spam or ham. If you'd like to do that, open your INBOX (or any folder where you've saved incoming messages) and:

Note that the only difference is whether you type -s for spam, or -n for ham. Do this until you run out of messages or get bored or tired. You can do this anytime, but you probably shouldn't classify any message more than once.

Tuning

At this point, based on your training (if any), incoming messages will either go into (a) your normal INBOX; (b) the IN.spam folder in your mail/ subdirectory; or (c) the IN.unsure folder in your mail/ subdirectory. (Folders in your mail/ subdirectory appear in your pine folder list.)

You'll want to handle these three mailboxes a little differently:

Advanced Use and Resources

Replacing mail/IN.spam with /dev/null in your .procmailrc file will cause incoming mail that would normally be placed in your IN.spam folder to be simply deleted instead. We don't recommend you do that due to the theoretical possibility of false positives, but we won't stop you.

There's more than one way to configure bogofilter. If you're interested, see the bogofilter home page. The shell command

    % man bogofilter

will, of course, show you the full command description. The idea behind bogofilter is described here.


[CIS Logo] Last modified: Friday, 12-Jun-2009 15:35:54 EDT
Paul A. Sand
pas@unh.edu