1 背景 1.1 k近邻算法的概述 (1)k近邻算法的简介

电影名称 打斗镜头 接吻镜头 电影类型
California Man 3 104 爱情片
He‘s Not Really into Dudes 2 100 爱情片
Beautiful Woman 1 81 爱情片
Kevin Longblade 101 10 动作片
Robo Slayer 3000 99 5 动作片
Amped II 98 2 动作片
18 90 未知
以California Man为例
>>>((3-18)**2+(104-90)**2)**(1/2) 20.518284528683193

电影名称 与未知i电影之间的距离
California Man 20.5
He‘s Not Really into Dudes 18.7
Beautiful Woman 19.2
Kevin Longblade 115.3
Robo Slayer 3000 117.4
Amped II 118.9
1.2 用python代码实现k近邻算法 (1)计算已知类别数据集中的每个点与当前点之间的距离
import numpy as np import operatordef classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize,1)) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]

>>>group = np.array([[1, 1.1], ...[1, 1], ...[0, 0], ...[0, 0.1]]) >>>labels = ['A', 'A', 'B', 'B'] >>>classify0([0,0], group, labels, 3) 'B'

1.3 如何测试分类器 正常来说为了测试分类器给出来的分类效果,我们通常采用计算分类器的错误率对分类器的效果进行评判。也就是采用分类出错的次数除以分类的总次数。完美的分类器的错误率为0,而最差的分类器的错误率则为1。
2 使用kNN算法改进约会网站的匹配效果 2.1 案例介绍 朋友海伦在使用约会软件寻找约会对象的时候,尽管网站会推荐不同的人选,但并不是每一个人她都喜欢,具体可以分为以下三类:不喜欢的人,魅力一般的人,极具魅力的人。尽管发现了以上的规律,但是海伦依旧无法将网站推荐的人归到恰当的类别,因此海伦希望我们的分类软件能更好的帮助她将匹配到的对象分配到确切的分类中。
2.2 数据的准备 以下提供两种下载数据集的渠道:
def file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines())#get the number of lines in the file returnMat = np.zeros((numberOfLines,3))#prepare matrix to return classLabelVector = []#prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split('\t') returnMat[index,:] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index += 1 return returnMat,classLabelVector

array([[4.0920000e+04, 8.3269760e+00, 9.5395200e-01], [1.4488000e+04, 7.1534690e+00, 1.6739040e+00], [2.6052000e+04, 1.4418710e+00, 8.0512400e-01], ..., [2.6575000e+04, 1.0650102e+01, 8.6662700e-01], [4.8111000e+04, 9.1345280e+00, 7.2804500e-01], [4.3757000e+04, 7.8826010e+00, 1.3324460e+00]]


2.3 数据分析:使用Matplotlib创建散点图 (1)玩视频游戏所耗时间百分比与每周消费冰淇淋公升数之间的相关关系图
import matplotlib import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(111) ax.scatter(datingDataMat[:,0], datingDataMat[:,1], 15.0*np.array(datingDLabels), 15.0*np.array(datingDLabels))

2.4 数据准备:归一化数值 由于通过欧式距离计算样本之间的距离时,对于飞行常客里程数来说,数量值巨大,会对结果影响的权重也会较大,而且远远大于其他两个特征,但是作为三个等权重之一,飞行常客里程数并不应该如此严重影响结果,例子如下

玩视频游戏所耗时间百分比 飞行常客里程数 每周消费冰淇淋公升数 样本分类
1 0.8 400 0.5 1
2 12 134000 0.9 3
3 0 20000 1.1 2
4 67 32000 0.1 2
def autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = np.zeros(np.shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - np.tile(minVals, (m,1)) normDataSet = normDataSet/np.tile(ranges, (m,1))#element wise divide return normDataSet, ranges, minVals

2.5 测试算法:作为完整程序验证分类器 评估正确率是机器学习算法中非常重要的一个步骤,通常我们会只使用训练样本的90%用来训练分类器,剩下的10%用于测试分类器的正确率。为了不影响数据的随机性,我们需要随机选择10%数据。
def datingClassTest(): hoRatio = 0.50#hold out 10% datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')#load data setfrom file normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] numTestVecs = int(m*hoRatio) errorCount = 0.0 for i in range(numTestVecs): classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) print ("the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i])) if (classifierResult != datingLabels[i]): errorCount += 1.0 print ("the total error rate is: %f" % (errorCount/float(numTestVecs))) print (errorCount)

2.5 使用算法:构建完整可用的系统 通过上面的学习,我们尝试给海伦开发一套程序,通过在约会网站找到某个人的信息,输入到程序中,程序会给出海伦对对方的喜欢程度的预测值:不喜欢,魅力一般,极具魅力
import numpy as np import operatordef file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines())#get the number of lines in the file returnMat = np.zeros((numberOfLines,3))#prepare matrix to return classLabelVector = []#prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split('\t') returnMat[index,:] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index += 1 return returnMat,classLabelVectordef autoNorm(dataSet): minVals = dataSet.min(0) maxVals = dataSet.max(0) ranges = maxVals - minVals normDataSet = np.zeros(np.shape(dataSet)) m = dataSet.shape[0] normDataSet = dataSet - np.tile(minVals, (m,1)) normDataSet = normDataSet/np.tile(ranges, (m,1))#element wise divide return normDataSet, ranges, minValsdef classify0(inX, dataSet, labels, k): dataSetSize = dataSet.shape[0] diffMat = np.tile(inX, (dataSetSize,1)) - dataSet sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() classCount={} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0]def classifyPerson(): resultList = ["not at all", "in small doses", "in large doses"] percentTats = float(input("percentage of time spent playing video games?")) ffMiles = float(input("ferquent fiter miles earned per year?")) iceCream = float(input("liters of ice ice crean consumed per year?")) datingDataMat,datingLabels = file2matrix('knn/datingTestSet2.txt')#load data setfrom file normMat, ranges, minVals = autoNorm(datingDataMat) inArr = np.array([percentTats, ffMiles, iceCream])classifierResult = classify0((inArr-minVals)/ranges, normMat, datingLabels,3) print ("You will probably like this person:", resultList[classifierResult-1])if __name__ == "__main__": classifyPerson()#10100000.5

percentage of time spent playing video games?10 ferquent fiter miles earned per year?10000 liters of ice ice crean consumed per year?0.5 You will probably like this person: not at all

3 使用kNN算法制作手写识别系统 3.1 案例介绍 以下案例以数字0-9的分类为例,简述如何采用k近邻算法对手写数字进行识别。
python|【机器学习实战 Task1】 (KNN)k近邻算法的应用

1 1 1 1 1 1 1 1 1 1
1 1 1 1 0 0 0 1 1 1
1 1 1 1 0 0 0 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 1 0 0 1 1 1 1
1 1 1 0 0 0 0 1 1 1
1 1 1 1 1 1 1 1 1 1
3.2 数据准备:将图像转换为测试向量 以下提供两种下载数据集的渠道:
python|【机器学习实战 Task1】 (KNN)k近邻算法的应用

def img2vector(filename): returnVect = np.zeros((1,1024)) fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVect

3.3 测试算法,使用kNN识别手写数字 (1)使用listdir读取trainingDigits目录下所有文件作为训练数据
def handwritingClassTest(): hwLabels = [] trainingFileList = listdir('trainingDigits')#load the training set m = len(trainingFileList) trainingMat = np.zeros((m,1024)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0]#take off .txt classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) testFileList = listdir('testDigits')#iterate through the test set errorCount = 0.0 mTest = len(testFileList) for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0]#take off .txt classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) print ("the classifier came back with: %d, the real answer is: %d"% (classifierResult, classNumStr)) if (classifierResult != classNumStr): errorCount += 1.0 print ("\nthe total number of errors is: %d" % errorCount) print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))

the classifier came back with: 7, the real answer is: 7 the classifier came back with: 7, the real answer is: 7 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 0, the real answer is: 0 the classifier came back with: 0, the real answer is: 0 the classifier came back with: 4, the real answer is: 4 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 7, the real answer is: 7 the classifier came back with: 7, the real answer is: 7 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 5, the real answer is: 5 the classifier came back with: 4, the real answer is: 4 the classifier came back with: 3, the real answer is: 3 the classifier came back with: 3, the real answer is: 3the total number of errors is: 11the total error rate is: 0.011628

4 总结 4.1 k-近邻算法的优缺点 (1)优点:精度高,对异常值不敏感,无数据输入假定
4.2 k-近邻算法的一般流程 【python|【机器学习实战 Task1】 (KNN)k近邻算法的应用】(1)收集数据:可以使用任何方法
4.3 k-近邻算法使用需要注意的问题 (1)数据特征之间量纲不统一时,需要对数据进行归一化处理,否则会出现大数吃小数的问题
5 Reference 《机器学习实战》
