automating spam assassin's learning process on plesk
Update [22/11/07]: I have updated the script to make it more generic
Amongst other techniques, Spam Assassin uses a bayesian filter to judge the probability that a mail is spam. The bayesian filter works on the probability that certain words in the mail identify it as spam or non spam. In order for this to be effective, the filter needs to be taught - and this can be automated to a degree.
It took a fair amount of Googling to work out this solution, so hopefully it will save you some time if you have a similar setup to mine (Linux, Plesk, Qmail, Spam Assassin). You may need to substitute your Spam, Learn and Trash folder names if they differ from mine:
1. Inside your Spam mail folder, create a folder named Learn.
2. If a spam mail is not caught by Spam Assassin, get in the habit of moving it manually to your Spam/Learn folder (do this in your mail client).
3. Create a script /var/scripts/dailyMailJobs as follows:
#!/bin/bash
MAILNAMES_PATH="/var/qmail/mailnames"
SPAM_LIFETIME_DAYS=2
TRASH_LIFETIME_DAYS=4
# learnAndFlush args are the following directories: MAIL SPAM SPAM.LEARN TRASH
function learnAndFlush {
echo -e "\n\nLearning new Bayesian data from spam for $1 on" `date`
sa-learn --dbpath ${MAILNAMES_PATH}/$1/.spamassassin --spam ${MAILNAMES_PATH}/$1/Maildir/$3/cur/
# Flush Spam.Learn
flush $1 $3
# Flush Spam
flush $1 $2 ${SPAM_LIFETIME_DAYS}
# Flush Trash
flush $1 $4 ${TRASH_LIFETIME_DAYS}
echo -e "\nLearning new Bayesian data from the last 24hrs of non-spam for $1"
find ${MAILNAMES_PATH}/$1/Maildir -mtime -1 -type d -name cur -not -path "*$2*" -not -path "*$4*" -not -path "*/Maildir/cur" -print -exec sa-learn --dbpath ${MAILNAMES_PATH}/$1/.spamassassin --ham {} \;
}
function flush {
echo -e "\nCleaning $2 from ${MAILNAMES_PATH}/$1"
mtimeArg=""
if [ "$3" ]
then
echo "Only deleting mail older than $3 days"
mtimeArg="-mtime +$3"
fi
find ${MAILNAMES_PATH}/$1/Maildir/$2/cur $mtimeArg -type f -exec rm {} \;
}
su popuser
# Substitute your mail directory here:
learnAndFlush chrisbeach.co.uk/chris .Spam .Spam.Learn .Trash
4. Place this in your crontab so it runs every day (e.g. at 00:15):
15 0 * * * /var/scripts/dailyMailJobs >> /var/cronjobs/logs/dailyMailJobs.log
This script will teach Spam Assassin that the mails in the Spam/Learn folder are spam, and the mails elsewhere are non-spam. It will also perform some housekeeping, deleting the Spam/Learn mails that have been learnt, and deleting old mails from Trash (5 days or older) and Spam (3 days or older).
01/02/07 08:12pm
(3 years, 7 months ago)



Excellent! This really does work
09:46pm