這篇文章主要講解了“Naive Bayes怎么使用”,文中的講解內(nèi)容簡單清晰,易于學(xué)習(xí)與理解,下面請大家跟著小編的思路慢慢深入,一起來研究和學(xué)習(xí)“Naive Bayes怎么使用”吧!
創(chuàng)新互聯(lián)公司是一家專注于做網(wǎng)站、網(wǎng)站制作與策劃設(shè)計(jì),鹿邑網(wǎng)站建設(shè)哪家好?創(chuàng)新互聯(lián)公司做網(wǎng)站,專注于網(wǎng)站建設(shè)10年,網(wǎng)設(shè)計(jì)領(lǐng)域的專業(yè)建站公司;建站業(yè)務(wù)涵蓋:鹿邑等地區(qū)。鹿邑做網(wǎng)站價格咨詢:13518219792
一、概述
優(yōu)點(diǎn):在數(shù)據(jù)少的情況下仍然有效,可以處理多類別問題
缺點(diǎn):對于輸入數(shù)據(jù)的準(zhǔn)備方式較為敏感
適用數(shù)據(jù)類型:標(biāo)稱型數(shù)據(jù)
二、原理
三、文檔分類
A,B,C,D..為文檔中單詞。假設(shè)總詞匯只有A,B,C,D四種。訓(xùn)練樣本為5個
A | B | C | D | 類別 | |
文檔1 | 0 | 0 | 1 | 1 | 0 |
文檔2 | 0 | 1 | 1 | 1 | 0 |
文檔3 | 1 | 0 | 0 | 1 | 1 |
文檔4 | 1 | 1 | 0 | 0 | 1 |
文檔5 | 1 | 1 | 1 | 0 | 1 |
測試文檔 | 1 | 0 | 1 | 0 | ? |
類別:C0,C1
測試文檔:W
求:max{P(C0|W),P(C1|W)} ===> max{log[P(C0|W)],log[P(C1|W)]}
P(C0|W) = P(W|C0) * P(C0) / P(W)
P(C0) = 2 / 5 ==> 2個0類型的文檔,3個1類型的文檔
P(W|C0) = P(A*B*C*D|C0) ==> Navie Bayes ==> P(A|C0) * P(B|C0) * P(C|C0) * P(D|C0)
P(A|C0)=(0 + 0)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=0 ==> A在類別0文檔中出現(xiàn)的次數(shù)/ 類別0文檔中的總詞匯量
P(B|C0)=(0 + 1)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=1/5 ==> B在類別0文檔中出現(xiàn)的次數(shù)/ 類別0文檔中的總詞匯量
P(C|C0)=(1 + 1)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=2/5 ==> C在類別0文檔中出現(xiàn)的次數(shù)/ 類別0文檔中的總詞匯量
P(D|C0)=(1 + 1)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=2/5 ==> D在類別0文檔中出現(xiàn)的次數(shù)/ 類別0文檔中的總詞匯量
因?yàn)橄喑藶榇嬖?* ==>0 取log
log[P(W|C0) * P(C0)] = log[P(A|C0) * P(B|C0) * P(C|C0) * P(D|C0) * P(C0)]
=log[P(A|C0)] + log[P(B|C0)] + log[P(C|C0)] + log[P(D|C0) ] + log[P(C0)]
同理計(jì)算log[P(W|C1) * P(C1)]
測試樣本:
log[P(C0|W)] = 0 * log(1/5) + 1 * log(2/5) + 0 * log(2/5) + log(2/5) =
log[P(C1|W)] = 1 * log(3/7) + 0 * log(2/7) + 1 * log(1/7) + 0 * log(1/7) + log(1 - 2/5) =
# -*- coding:UTF-8 from numpy import * ''' 1.伯努利模型==>不考慮詞在文檔中出現(xiàn)的次數(shù),只考慮出不出現(xiàn)。假定詞是等權(quán)重中的 2.多項(xiàng)式模型 ''' def loadDataSet(): postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'], ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'], ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'], ['stop', 'posting', 'stupid', 'worthless', 'garbage'], ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'], ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']] classVec = [0,1,0,1,0,1] return postingList,classVec def createVocabList(dataSet): vocaSet = set([]) for document in dataSet: vocaSet = vocaSet | set(document) return list(vocaSet) ''' vocabList = ['','',.....] inputSet = ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'] ''' def setOfWords2Vec(vocabList,inputSet): returnVec = [0] * len(vocabList) for word in inputSet: if word in vocabList: returnVec[vocabList.index(word)] = 1 else: print 'the word: %s is not in my vocabulary!' % word return returnVec ''' P(c|w) = P(w|c) * P(c) / P(w) 1.P(c) 2.P(w|c) trainMatrix trainCategory===>[0,0,1,1,0] 標(biāo)簽集合的向量 pAbusive = (0 + 0 + 1 + 1 + 0) / 5 A B C D category 0 0 1 1 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 1 0 ? numTrainDocs = 5 => 5個文檔 numWords = 4 => 4個特征 pAbusive = (0 + 0 + 0 + 1 + 1) / 5 = 2/5 ==> 先驗(yàn)概率 p0Num = [0,0,0,0] p1Num = [0,0,0,0] p0Denom = 0.0 p1Denom = 0.0 0 0 1 1 0 ===> p0Num=[0,0,1,1] p0Denom=1 0 1 1 1 0 ===> p0Num=[0,1,2,2] p0Denom=2 1 0 0 1 1 ===> p1Num=[1,0,0,1] p1Denom=1 1 1 0 0 1 ===> p1Num=[2,1,0,1] p1Denom=2 1 1 1 0 1 ===> p1Num=[3,2,1,1] p1Denom=3 P(C0|W) = P(W|C0) * P(C0) / P(W) = P(A*B*C*D|C0) * P(C0) / P(W) = P(A|C0) * P(B|C0) * P(C|C0) * P(D|C0) * P(C0) / P(W) P(C1|W) = P(W|C1) * P(C1) / P(W) = P(A*B*C*D|C1) * P(C1) / P(W) = P(A|C1) * P(B|C1) * P(C|C1) * P(D|C1) * P(C1) / P(W) P(W) ==> 無需再計(jì)算了 max{P(C0|W),P(C1|W)} ===> max{Log[P(C0|W)],Log[P(C1|W)]} Log[P(C0|W)] = Log[P(A|C0)] + Log[P(B|C0)] + Log[P(C|C0)] + Log[P(D|C0)] + Log[P(C0)] P(A|C0) = 0/(0+1+2+2) = 0/5 P(B|C0) = 1/(0+1+2+2) = 1/5 P(C|C0) = 2/(0+1+2+2) = 2/5 P(D|C0) = 2/(0+1+2+2) = 2/5 Log[P(C1|W)] = Log[P(A|C1)] + Log[P(B|C1)] + Log[P(B|C1)] + Log[P(B|C1)] + Log[P(C1)] P(A|C1) = 3/(3+2+1+1) = 3/7 P(B|C1) = 2/(3+2+1+1) = 2/7 P(C|C1) = 1/(3+2+1+1) = 1/7 P(D|C1) = 1/(3+2+1+1) = 1/7 測試樣本1 0 1 0 ? Log[P(C0|W)] = 1 * Log[0/5] + 0 * Log[1/5] + 1 * Log[2/5] + 0 * Log[2/5] + Log[2/5] Log[P(C1|W)] = 1 * Log[3/7] + 0 * Log[2/7] + 1 * Log[1/7]+ 0 * Log[1/7] + Log[1 - 2/5] 注意存在Log[0] ==> 所有初始化,我們設(shè)置 p0Num = [1,1,1,1] p1Num = [1,1,1,1] p0Denom = 2.0 p1Denom = 2.0 ''' def trainNB0(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory) / float(numTrainDocs) p0Num = zeros(numWords) p1Num = zeros(numWords) p0Denom = 0.0 p1Denom = 0.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vec = log(p1Num/p1Denom) p0Vec = log(p0Num/p0Denom) return p0Vec,p1Vec,pAbusive def trainNB1(trainMatrix,trainCategory): numTrainDocs = len(trainMatrix) numWords = len(trainMatrix[0]) pAbusive = sum(trainCategory) / float(numTrainDocs) p0Num = ones(numWords) p1Num = ones(numWords) p0Denom = 2.0 p1Denom = 2.0 for i in range(numTrainDocs): if trainCategory[i] == 1: p1Num += trainMatrix[i] p1Denom += sum(trainMatrix[i]) else: p0Num += trainMatrix[i] p0Denom += sum(trainMatrix[i]) p1Vec = log(p1Num/p1Denom) p0Vec = log(p0Num/p0Denom) return p0Vec,p1Vec,pAbusive def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1): p1 = sum(vec2Classify * p1Vec) + log(pClass1) p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) if p1 > p0: return 1 else: return 0 def testingNB(): listOPosts,listClasses = loadDataSet() myVocabList = createVocabList(listOPosts) trainMat = [] for postingDoc in listOPosts: trainMat.append(setOfWords2Vec(myVocabList, postingDoc)) p0V,p1V,pAb = trainNB0(trainMat, listClasses) testEntry = ['love','my','dalmation'] thisDoc = array(setOfWords2Vec(myVocabList, testEntry)) print(testEntry,' classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))
四、過濾垃圾郵件
def textParse(bigString): import re listOfTokens = re.split(r'\W*', bigString) #簡單空格分詞 return [tok.lower() for tok in listOfTokens if len(tok) > 2] #簡單過濾詞長<=2的詞 def spamTest(): docList = [] classList = [] #fullText = [] for i in range(1,26): #讀取所有的單詞 wordList = textParse(open('emial/spam/%d.txt' % i).read()) docList.append(wordList) #fullText.extend(wordList) classList.append(1) wordList = textParse(open('emial/ham/%d.txt' % i).read()) docList.append(wordList) #fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList) trainSet = range(50) testSet = [] for i in range(10): randIndex = int(random.uniform(0,len(trainSet))) testSet.append(trainSet[randIndex]) del(trainSet[randIndex]) trainMat = [] trainClasses = [] for docIndex in trainSet: trainMat.append(setOfWords2Vec(vocabList,docList[docIndex])) trainClasses.append(classList[docIndex]) p0V,p1V,pSpam = trainNB0(trainMat, trainClasses) errorCount = 0 for docIndex in testSet: wordVector = setOfWords2Vec(vocabList, docList[docIndex]) if classifyNB(wordVector, p0V, p1V, pSpam) != classList[docIndex]: errorCount += 1 print 'classification error',docList[docIndex] print 'the error rate is: ',float(errorCount) / len(testSet)
感謝各位的閱讀,以上就是“Naive Bayes怎么使用”的內(nèi)容了,經(jīng)過本文的學(xué)習(xí)后,相信大家對Naive Bayes怎么使用這一問題有了更深刻的體會,具體使用情況還需要大家實(shí)踐驗(yàn)證。這里是創(chuàng)新互聯(lián),小編將為大家推送更多相關(guān)知識點(diǎn)的文章,歡迎關(guān)注!