Building a Content-Based Spam Filter with Naive Bayes

Getting and Preparing the Data

Download the files

In [1]:
from os import path
spam_url = "http://spamassassin.apache.org/old/publiccorpus/20050311_spam_2.tar.bz2"
ham_url = "http://spamassassin.apache.org/old/publiccorpus/20030228_easy_ham_2.tar.bz2"

spam_archive = path.basename(spam_url)
ham_archive = path.basename(ham_url)
In [2]:
#!rm $ham_archive $spam_archive
In [3]:
### Download ham and spam archives if we don't have them yet
![ ! -f $spam_archive ] && wget $spam_url
![ ! -f $ham_archive ] && wget $ham_url
In [4]:
!ls *.bz2
20030228_easy_ham_2.tar.bz2  20050311_spam_2.tar.bz2
In [5]:
print(ham_archive, spam_archive)
20030228_easy_ham_2.tar.bz2 20050311_spam_2.tar.bz2

Exploring the Archives

In [6]:
!tar -tvf $spam_archive | head -10
drwxrwxr-x jm/jm             0 2005-03-11 23:46 spam_2/
-rw-rw-r-- jm/jm          4721 2003-02-28 10:58 spam_2/00001.317e78fa8ee2f54cd4890fdc09ba8176
-rw-rw-r-- jm/jm          6165 2003-02-28 10:58 spam_2/00002.9438920e9a55591b18e60d1ed37d992b
-rw-rw-r-- jm/jm          6942 2003-02-28 10:58 spam_2/00003.590eff932f8704d8b0fcbe69d023b54d
-rw-rw-r-- jm/jm          7120 2003-02-28 10:58 spam_2/00004.bdcc075fa4beb5157b5dd6cd41d8887b
-rw-rw-r-- jm/jm          4527 2003-02-28 10:58 spam_2/00005.ed0aba4d386c5e62bc737cf3f0ed9589
-rw------- jm/jm         22348 2003-02-28 10:58 spam_2/00006.3ca1f399ccda5d897fecb8c57669a283
-rw------- jm/jm          9360 2003-02-28 10:58 spam_2/00151.6abbf42bc1bfb6c36b749372da0cffae
-rw-rw-r-- jm/jm         12702 2003-02-28 10:58 spam_2/00008.ccf927a6aec028f5472ca7b9db9eee20
-rw-rw-r-- jm/jm          6565 2003-02-28 10:58 spam_2/00009.1e1a8cb4b57532ab38aa23287523659d
tar: write error

Each email is in a separate file (with a cryptic name ;-)

Extract Email Messages from Archives

In [7]:
## .tar(.bz2, etc.) support in Python standard library:
import tarfile

## email parsing, also Python standard library:
import email

def iterate_emails(tar_path):
    tar = tarfile.open(tar_path)
    emails = (f for f in tar if f.isfile())
    for info in emails:
        f = tar.extractfile(info)
        ## parse contents of compressed file into an Email-object:
        msg = email.message_from_binary_file(f)
        yield msg
        f.close()

Aside: Python generators

In [8]:
def generator():
    for i in range(4):
        yield i

generator()
Out[8]:
<generator object generator at 0x7f2ce027cfc0>
In [9]:
next(generator())
Out[9]:
0
In [10]:
list(generator())
Out[10]:
[0, 1, 2, 3]

Back to our Emails:

In [11]:
iterate_emails(spam_archive)
Out[11]:
<generator object iterate_emails at 0x7f2ce027c5c8>
In [12]:
next(iterate_emails(spam_archive))
Out[12]:
<email.message.Message at 0x7f2cd1bb6588>

Headers

In [13]:
msg = list(iterate_emails(spam_archive))[32]
print(msg.as_string()[:1000])
Return-Path: <LifeQuotes104@yahoo.com>
Delivered-To: yyyy@netnoteinc.com
Received: from mail (unknown [211.234.63.154]) by mail.netnoteinc.com
    (Postfix) with ESMTP id E4E57130028 for <jm@netnoteinc.com>;
    Wed, 27 Jun 2001 04:08:46 +0100 (IST)
Received: from sdn-ar-002riprovP318.dialsprint.net_[168.191.126.224]
    (sdn-ar-002riprovp318.dialsprint.net [168.191.126.224]) by mail
    (8.10.1/8.10.1) with SMTP id f5R3juR08320; Wed, 27 Jun 2001 12:46:04 +0900
Received: from Life 300(113.2.2.1) Life1 by
    sdn-ar-002riprovP318.dialsprint.net with ESMTP; Tue, 26 Jun 2001 23:09:24
    -0400
Message-Id: <000034e1158c$00001e19$000071e3@Life 300(113.2.2.1) Life1>
To: <LifeQuotes104@yahoo.com>
From: LifeQuotes104@yahoo.com
Subject: Double Your Life Insurance at NO EXTRA COST! 29155
Date: Tue, 26 Jun 2001 23:09:10 -0400
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Priority: 1
X-Msmail-Priority: High
Reply-To: No_Quotes102@yahoo.com


<html>

<head>
<meta http-equiv=3D"Con

Body

In [14]:
def print_highlight(s):
    """Print email with some Quoted-Printable escape characters highlighted."""
    C = "\x1b["
    HLC = "48;2;252;227;40m"
    s = s.replace("=\n", C + HLC + "=\n" + C + "0m")
    s = s.replace("=3D", C + HLC + "=3D" + C + "0m")
    print(s)
In [15]:
print_highlight(msg.as_string()[900:1300])
 1
X-Msmail-Priority: High
Reply-To: No_Quotes102@yahoo.com


<html>

<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Diso-8859=
-1">
<title>Lowest Life Insurance Quotes</title>
<style>td {font-family: arial}</style>
</head>

<body bgcolor=3D"#ffffff">
<div STYLE=3D"FONT-FAMILY:TIMES"><b><font face=3D"ARIAL" color=3D"blue" si=
ze=3D"6">

<p align=3D"center">The Lowest Life 

The emails are in RFC2045 "Quoted-Printable" encoding -- among other things, lines are wrapped at 76 characters, and equals signs are escaped. Python's `email` library can un-wrap text lines and do other MIME decoding.

In [16]:
print(msg.get_payload(decode=True).decode('utf-8')[:500])
<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Lowest Life Insurance Quotes</title>
<style>td {font-family: arial}</style>
</head>

<body bgcolor="#ffffff">
<div STYLE="FONT-FAMILY:TIMES"><b><font face="ARIAL" color="blue" size="6">

<p align="center">The Lowest Life Insurance Quotes</font></b> <br>
<b><i><font face="arial" color="red" size="5">Without the Hassle!</font></i></b> </p>

<p align="center"><font face="ARIAL" color="black" size="4">Com

Parsing Emails

There are several inconsistencies and edge cases in the email data. In the following, we provide a parsing function that can handle those:

In [17]:
def mail_text(msg):
    headers = []
    for k, v in msg.items():
        headers.append(k)
        if type(v) is str:
            headers.append(v)
    text_parts = (p for p in msg.walk() 
                  if p.get_content_type().startswith('text'))
    contents = []
    for txt in text_parts:
        charset = txt.get_content_charset()
        try:
            ## decode MIME encoding
            payload = txt.get_payload(decode=True)
            try:
                payload = payload.decode(charset)
            except:
                ## if charset from header doesn't work, force UTF-8
                payload = payload.decode('utf-8', 'replace')
            contents.append(payload)
        except:
            contents.append(txt.get_payload())
    return " ".join(headers + contents)
In [18]:
msg = next(iterate_emails(spam_archive))
print(mail_text(msg)[:1500])
Return-Path <ilug-admin@linux.ie> Delivered-To yyyy@localhost.netnoteinc.com Received from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD
	for <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT) Received from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST) Received from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for
    <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100 Received from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org
    (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100 Received from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net
    [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for
    <ilug@linux.ie>; Fri, 2 Aug 2002 22:50:11 +0100 X-Authentication-Warning lugh.tuatha.org: Host w142.z064000057.nyc-ny.dsl.cnc.net
    [64.0.57.142] claimed to be bettyjagessar.com Received from 64.0.57.142 [202.63.165.34] by bettyjagessar.com
    (SMTPD32-7.06 EVAL) id A42A7FC01F2; Fri, 02 Aug 2002 02:18:18 -0400 Message-Id <1028311679.886@0.57.142> Date Fri, 02 Aug 2002 23:37:59 0530 To ilug@linux.ie From "Start Now" <startnow2002@hotmail.com> MIME-Version 1.0 Content-Type text/plain; charset="US-ASCII"; format=flowed Subject [ILUG] STOP THE MLM INSANITY Sender ilug-admin@linux.ie Errors-To ilug-admin@li

Collecting all the Ham and Spam emails

In [19]:
spam = [mail_text(msg) for msg in iterate_emails(spam_archive)]
ham = [mail_text(msg) for msg in iterate_emails(ham_archive)]
In [20]:
print(len(spam), len(ham))
1397 1401

Next Tasks

  • Counting words in Spam an Ham messages
  • Turning words into probabilities
  • Training and testing the classifier

Final hints:

  • consider using regular expressions
In [21]:
test_string = """Enjoy this special offer! 
Buy the super-viagra today for only $99!!
The offer is only available today!
"""
In [22]:
import re ## Python standard regex library

token = re.compile(r"[\w'$-]+")
tokens = token.findall(test_string.lower())
print(tokens)
['enjoy', 'this', 'special', 'offer', 'buy', 'the', 'super-viagra', 'today', 'for', 'only', '$99', 'the', 'offer', 'is', 'only', 'available', 'today']
  • the "Counter" class might be useful
In [23]:
from collections import Counter

c = Counter()
print(tokens)
['enjoy', 'this', 'special', 'offer', 'buy', 'the', 'super-viagra', 'today', 'for', 'only', '$99', 'the', 'offer', 'is', 'only', 'available', 'today']
In [24]:
c.update(tokens)
In [25]:
print(c)
Counter({'offer': 2, 'the': 2, 'today': 2, 'only': 2, 'enjoy': 1, 'this': 1, 'special': 1, 'buy': 1, 'super-viagra': 1, 'for': 1, '$99': 1, 'is': 1, 'available': 1})

From here, you're on your own