机器学习入门（11）——使用朴素贝叶斯进行交叉验证萤火

本系列代码托管于：https://github.com/chintsan-code/machine-learning-tutorials
本篇使用的项目为：bayes_email

在上一篇中，我们使用了改进后的贝叶斯分类器对于文本进行分类，本篇将继续使用这一分类器，并对垃圾邮件进行过滤，同时介绍留存交叉验证。

首先来看下我们的样本：
./email/spam路径下存放的是25份垃圾邮件
./email/ham路径下存放的是25份正常邮件
它们的内容都类似于：

机器学习入门（11）——使用朴素贝叶斯进行交叉验证-萤火

基本上可以使用空格对这些文本进行切片，但是如果这样子，会引入标点符号等，因此我们本次使用正则模块来对文章进行切片，同时对切完之后的单词做一个长度的过滤，滤除过短的无用词。具体代码如下：

def textParse(bigString):  # input is big string, #output is word list
    import re
    listOfTokens = re.split(r'\W+', bigString)  # 有些版本需要将W+改为W*
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

像上图所示的邮件，切片后的结果为：[‘codeine’, ’15mg’, ‘for’, ‘203’, ‘visa’, ‘only’, ‘codeine’, ‘methylmorphine’, ‘narcotic’, ‘opioid’, ‘pain’, ‘reliever’, ‘have’, ’15mg’, ’30mg’, ‘pills’, ’15mg’, ‘for’, ‘203’, ’15mg’, ‘for’, ‘385’, ’15mg’, ‘for’, ‘562’, ‘visa’, ‘only’]

数据集准备完成之后，我们就可以使用之间构建的分类器进行分类了。代码如下：

def spamTest():
    docList = []
    classList = []
    fullText = []
    for i in range(1, 26):
        wordList = textParse(open('email/spam/%d.txt' % i, encoding="unicode_escape").read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(open('email/ham/%d.txt' % i, encoding="unicode_escape").read())
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)  # create vocabulary
    # 从50封邮件中随机抽取10份做测试集,其余40份做训练集
    trainingSet = list(range(50))
    testSet = []  # create test set
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del (trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:  # train the classifier (get probs) trainNB0
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNB0mprovement(array(trainMat), array(trainClasses))
    errorCount = 0
    for docIndex in testSet:  # classify the remaining items
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
            print("classification error", docList[docIndex])
    print('the error rate is: ', float(errorCount) / len(testSet))
    # return vocabList,fullText

在这份代码中，我们在进行训练和测试时，我们再总共的50份样本(25份正样本，25份负样本)中，我们随机选取了10份样本作为测试集，剩余的40份样本作为训练集。这种随机选择数据的一部分作为训练集，而剩余部分作为测试集的过程称为留存交叉验证（hold-out cross validation）。

我们之前将每个词的出现与否作为一个特征，这可以被描述为词集模型（set-of-words model）。如果一个词在文档中出现不止一次，这可能意味着包含该词是否出现在文档中所不能表达的某种信息，这种方法被称为词袋模型（bag-of-words model）。在词袋中，每个单词可以出现多次，而在词集中，每个词只能出现一次。为适应词袋模型，需要对函数setOfWords2Vec() 稍加修改，修改后的函数称为bagOfWords2Vec() 。
词集模型对应我们之前提到的贝努利模型，词袋模型对应多项式模型。在代码实现上，他们仅仅存在一处不同：

def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec

def bagOfWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1  # 仅此处不同
    return returnVec

函数spamTest()会输出在10封随机选择的电子邮件上的分类错误率。既然这些电子邮件是随机选择的，所以每次的输出结果可能有些差别。如果发现错误的话，函数会输出错分文档的词表，这样就可以了解到底是哪篇文档发生了错误。如果想要更好地估计错误率，那么就应该将上述过程重复多次，比如说10次，然后求平均值。例如我某一次输出结果是这样的：

classification error ['home', 'based', 'business', 'opportunity', 'knocking', 'your', 'door', 'don', 'rude', 'and', 'let', 'this', 'chance', 'you', 'can', 'earn', 'great', 'income', 'and', 'find', 'your', 'financial', 'life', 'transformed', 'learn', 'more', 'here', 'your', 'success', 'work', 'from', 'home', 'finder', 'experts']
the error rate is:  0.1

即10个测试集出现了1次错误。