How to train/fix the Bayes Database in Mac OS X 10.5 Leopard Server?

Mac OS X Server 10.5, like 10.4.x, comes with SpamAssassin pre-installed. Postfix interfaces with SpamAssassin through amavisd-new.

SpamAssassin uses its Bayes database to improve accuracy. Except for its auto-learning capability, SpamAssassin also relies on users feeding spam and false positives to the Bayes database to further increase detection accuracy.

On Mac OS X Server, this is normally done by redirecting SPAM to a special mailbox called "junkmail" and HAM to a mailbox called "notjunkmail". Apple has provided a script called "learn_junk_mail", which runs once every 24 hours and will read the contents of the mailboxes junkmail and notjunkmail and feed them to the Bayes database.

So far so good. Except… that it won't work because the setup is broken. For those of you who have used SpamAssassin under 10.4.x this may sound familiar. This time however, the cause is a different one. Please do not try to correct this using the same steps (symlink) as in 10.4.x.

In 10.5, the users and directories related to SpamAssassin, amavisd-new and ClamAV have been shuffled around a bit. Most processes now run under the system user _amavisd and all system users needed by mail services (_amavisd, _clamav, _cyrus, _postfix) have no shell set. _clamav still has it's home directory set to /var/clamav, but _amavisd is now set to /var/virusmails, while amavisd.conf points to /var/amavis.
 
All these changes are basically very good, be it for logical structure, be it for security. One side effect though, is that the "learn_junk_mail" script will not work correctly anymore in 10.5.0-10.5.1 (Apple fixed it as of 10.5.2). The way it is written, it runs as root and relies on substituting into user _amavisd and its environment. But, as I mentioned before, user _amavisd has no shell and its home directory is set to the "wrong" place. In other words, the script will run every 24 hours, but no training of the Bayes database will take place.

How can this be fixed? The simplest way is to leave everything as is and use spamtrainer instead of learn_junk_mail. It has been modified to run on Leopard and circumvent the outlined issues. Besides, it has many more features, like the ability to delete spam/ham it has learned from. It's available for free here

If you prefer to keep all things as Apple delivered them with Mac OS X Server, your best bet is to modify /etc/mail/spamassassin/learn_junk_mail. Alternatively you could obviously correct user _amavisd's home directory and give it shell access, but I wouldn't recommend it.

To fix this, make a backup of and edit /etc/mail/spamassassin/learn_junk_mail and:

change:

cat $file | su - _amavisd -c "sa-learn --spam --no-sync >> /dev/null";

to:

cat $file | sudo -u _amavisd sa-learn --dbpath /var/amavis/.spamassassin --spam --no-sync >> /dev/null;

change:

cat $file | su - _amavisd -c "sa-learn --ham --no-sync >> /dev/null";

to:

cat $file | sudo -u _amavisd sa-learn --dbpath /var/amavis/.spamassassin --ham --no-sync >> /dev/null;

change:

su - _amavisd  -c "sa-learn --sync >> /dev/null"

to:

sudo -u _amavisd sa-learn --dbpath /var/amavis/.spamassassin --sync >> /dev/null

When done, save and enjoy. From now on your Bayes database will be trained correctly.