False Positive Spam Identification with Python
My email system typically sees hundreds of spam messages per day. SpamAssassin catches most of these, and Apple Mail generally gets the rest. However I do worry about false positives even though these days that case seems to be very rare. I've set my SpamAssassin threshold level to be relatively low, and procmail moves all spam with scores from 20 down to my threshold into a junk folder. To make sure I don't miss any false-positives, I run the Python script below to scan the junk folder for potentially interesting emails.
What's nice about this script is the efficiency of using it: just type junk at the command prompt, and I see a list of all of the messages with scores below 10 (by default), nicely lined up for a very quick scan. If something looks like it might not be junk, I can just type junk -p 5 for example, to have a quick look at message number five. If it's not junk, I'll either use the command line or Apple Mail to move the message out of the junk folder. The output ignores any human readable From address (which is usually forged anyway) and just prints the raw email address. Here's an example:
$ junk Mail message with spam scores below 10 Msg SL From Subject --- -- --------------------------------- ------- 3 8 PCVZJ@djkjkkkd.com ****Cheap Cialis**** 0 9 zjhsiyfsx@cashette.com what will your kids do when you die 1 9 miachung_rz@worldonline.de =?ISO-8859-1?B?WW91ciBwYXltZW50cyBjYW4gY
The script is setup to use Maildirs, but it should be easy to use other mail box formats as well. Perhaps someday I can add logic to see if the 'received-from' headers make sense with respect to the from address, but for now this works well enough.
#! /usr/bin/python
import sys, getopt
import mailbox, email
from email.Utils import parseaddr
#
# settings you will need to change
#
maildir = '/home/username/.maildir/.junkmail'
spamtag = '***SPAM***'
defaultSpamLevel = 10
###
spamtaglen = len(spamtag)
def buildList( tlevel ):
tlist = []
i=0
jf= mailbox.Maildir( maildir, email.message_from_file )
for msg in jf:
l = 0
if 'x-spam-level' in msg:
l = len(msg['x-spam-level'])
if l <= tlevel:
(tmp,f) = parseaddr(msg['from'])
s = msg.get('subject', '')
if s.startswith( spamtag ):
s = s[spamtaglen+1:]
tlist.append( (i, l, f, s, msg) )
i += 1
return tlist
def printMessage( tlist, i ):
(i,l,f,s,msg) = tlist[i]
print msg
def usage():
print "junk [-s | --spam ] [-p | --print ]"
def main():
try:
opts, args = getopt.getopt(sys.argv[1:],
"s:p:h",
["spam", "print", "help"])
except getopt.GetoptError:
usage()
sys.exit(2)
printMsg = None
tlevel = defaultSpamLevel
for o, v in opts:
if o in ("-h", "--help"):
usage()
sys.exit()
if o in ("-s", "--spam"):
print "option: spam level=", v
tlevel = int(v)
if o in ("-p", "--print"):
printMsg = int(v)
l = buildList( tlevel )
if printMsg != None:
printMessage( l, int(sys.argv[2]) )
else:
printList( l, tlevel )
if __name__ == "__main__":
main()
***
highlight file error
***
Categories: technology python
oganxtnasnb on Monday 15 March, 2010: