hands-on-data-analysis 第三单元 模型搭建和评估

盛年不重来,一日难再晨,及时当勉励,岁月不待人。这篇文章主要讲述hands-on-data-analysis 第三单元 模型搭建和评估相关的知识,希望能为你提供帮助。
hands-on-data-analysis 第三单元 模型搭建和评估@[TOC]
1.模型搭建 1.1.导入相关库

import pandas as pd import numpy as np # matplotlib.pyplot 和 seaborn 是绘图库 import matplotlib.pyplot as plt import seaborn as sns from Ipython.display import Image

# 内嵌显示图片 %matplotlib inline

plt.rcParams[font.sans-serif] = [SimHei]# 用来正常显示中文标签 plt.rcParams[axes.unicode_minus] = False# 用来正常显示负号 plt.rcParams[figure.figsize] = (10, 6)# 设置输出图片大小

1.2.数据集的载入
# 读取原数据数集 train = pd.read_csv(train.csv) train.shape

输出为:
(891, 12)

1.3.数据集分析
train.head()

输出为:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train.info()

< class pandas.core.frame.DataFrame> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): #ColumnNon-Null CountDtype ---------------------------- 0PassengerId891 non-nullint64 1Survived891 non-nullint64 2Pclass891 non-nullint64 3Name891 non-nullobject 4Sex891 non-nullobject 5Age714 non-nullfloat64 6SibSp891 non-nullint64 7Parch891 non-nullint64 8Ticket891 non-nullobject 9Fare891 non-nullfloat64 10Cabin204 non-nullobject 11Embarked889 non-nullobject dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB

可以看到这些数据还是需要清洗的,清洗过后的数据集如下:
#读取清洗过的数据集 data = https://www.songbingjia.com/android/pd.read_csv(clear_data.csv)

data.head()

PassengerId Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 0 3 22.0 1 0 7.2500 0 1 0 0 1
1 1 1 38.0 1 0 71.2833 1 0 1 0 0
2 2 3 26.0 0 0 7.9250 1 0 0 0 1
3 3 1 35.0 1 0 53.1000 1 0 0 0 1
4 4 3 35.0 0 0 8.0500 0 1 0 0 1
data.info()

输出为:
< class pandas.core.frame.DataFrame> RangeIndex: 891 entries, 0 to 890 Data columns (total 11 columns): #ColumnNon-Null CountDtype ---------------------------- 0PassengerId891 non-nullint64 1Pclass891 non-nullint64 2Age891 non-nullfloat64 3SibSp891 non-nullint64 4Parch891 non-nullint64 5Fare891 non-nullfloat64 6Sex_female891 non-nullint64 7Sex_male891 non-nullint64 8Embarked_C891 non-nullint64 9Embarked_Q891 non-nullint64 10Embarked_S891 non-nullint64 dtypes: float64(2), int64(9) memory usage: 76.7 KB

1.4.模型搭建
【hands-on-data-analysis 第三单元 模型搭建和评估】sklearn的算法选择路径
hands-on-data-analysis 第三单元 模型搭建和评估

文章图片

分割数据集
# train_test_split 是用来切割数据集的函数 from sklearn.model_selection import train_test_split

# 一般先取出X和y后再切割,有些情况会使用到未切割的,这时候X和y就可以用,x是清洗好的数据,y是我们要预测的存活数据Survived X = data y = train[Survived]

# 对数据集进行切割 X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# 查看数据形状 X_train.shape, X_test.shape

输出为:
((668, 11), (223, 11))

X_train.info()

输出为:
< class pandas.core.frame.DataFrame> Int64Index: 668 entries, 671 to 80 Data columns (total 11 columns): #ColumnNon-Null CountDtype ---------------------------- 0PassengerId668 non-nullint64 1Pclass668 non-nullint64 2Age668 non-nullfloat64 3SibSp668 non-nullint64 4Parch668 non-nullint64 5Fare668 non-nullfloat64 6Sex_female668 non-nullint64 7Sex_male668 non-nullint64 8Embarked_C668 non-nullint64 9Embarked_Q668 non-nullint64 10Embarked_S668 non-nullint64 dtypes: float64(2), int64(9) memory usage: 82.6 KB

X_test.info()

输出为:
< class pandas.core.frame.DataFrame> Int64Index: 223 entries, 288 to 633 Data columns (total 11 columns): #ColumnNon-Null CountDtype ---------------------------- 0PassengerId223 non-nullint64 1Pclass223 non-nullint64 2Age223 non-nullfloat64 3SibSp223 non-nullint64 4Parch223 non-nullint64 5Fare223 non-nullfloat64 6Sex_female223 non-nullint64 7Sex_male223 non-nullint64 8Embarked_C223 non-nullint64 9Embarked_Q223 non-nullint64 10Embarked_S223 non-nullint64 dtypes: float64(2), int64(9) memory usage: 30.9 KB

1.5.导入模型
1.5.1.默认参数的逻辑回归模型
from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression() lr.fit(X_train, y_train)

输出为:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class=auto, n_jobs=None, penalty=l2, random_state=None, solver=lbfgs, tol=0.0001, verbose=0, warm_start=False)

# 查看训练集和测试集score值 print("Training set score: :.2f".format(lr.score(X_train, y_train))) print("Testing set score: :.2f".format(lr.score(X_test, y_test)))

Training set score: 0.80 Testing set score: 0.79

1.5.2.调节参数的逻辑回归模型
lr2 = LogisticRegression(C=100) lr2.fit(X_train, y_train)

输出为:
LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class=auto, n_jobs=None, penalty=l2, random_state=None, solver=lbfgs, tol=0.0001, verbose=0, warm_start=False)

print("Training set score: :.2f".format(lr2.score(X_train, y_train))) print("Testing set score: :.2f".format(lr2.score(X_test, y_test)))

输出为:
Training set score: 0.79 Testing set score: 0.78

1.5.3.默认参数的随机森林分类模型
rfc = RandomForestClassifier() rfc.fit(X_train, y_train)

输出为:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion=gini, max_depth=None, max_features=auto, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

print("Training set score: :.2f".format(rfc.score(X_train, y_train))) print("Testing set score: :.2f".format(rfc.score(X_test, y_test)))

输出为:
Training set score: 1.00 Testing set score: 0.82

1.5.4.调整参数后的随机森林分类模型
rfc2 = RandomForestClassifier(n_estimators=100, max_depth=5) rfc2.fit(X_train, y_train)

输出为:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion=gini, max_depth=5, max_features=auto, max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

print("Training set score: :.2f".format(rfc2.score(X_train, y_train))) print("Testing set score: :.2f".format(rfc2.score(X_test, y_test)))

输出为:
Training set score: 0.87 Testing set score: 0.81

1.6.预测模型
一般监督模型在sklearn里面有个predict能输出预测标签,predict_proba则可以输出标签概率
# 预测标签 pred = lr.predict(X_train)

# 此时我们可以看到0和1的数组 pred[:10]

输出为:
array([0, 1, 1, 1, 0, 0, 1, 0, 1, 1])

# 预测标签概率 pred_proba = lr.predict_proba(X_train)

pred_proba[:10]

输出为:
array([[0.60884602, 0.39115398], [0.17563455, 0.82436545], [0.40454114, 0.59545886], [0.1884778 , 0.8115222 ], [0.88013064, 0.11986936], [0.91411123, 0.08588877], [0.13260197, 0.86739803], [0.90571178, 0.09428822], [0.05273217, 0.94726783], [0.10924951, 0.89075049]])

2.模型评估 2.1.交叉验证
交叉验证有很多种,第一种是最简单的,也是很容易就想到的:把数据集分成两部分,一个是训练集(training set),一个是测试集(test set)。
不过,这个简单的方法存在两个弊端。
1.最终模型与参数的选取将极大程度依赖于你对训练集和测试集的划分方法。
2.该方法只用了部分数据进行模型的训练,未能充分利用数据集的数据。
为了解决这个问题,后面的技术人员进行了多种优化,接下来提到的就是K折交叉验证:
我们每次的测试集将不再只包含一个数据,而是多个,具体数目将根据K的选取决定。比如,如果K=5,那么我们利用七折交叉验证的步骤就是:
1.将所有数据集分成7份
2.不重复地每次取其中一份做测试集,用其他 6 份做训练集训练模型,之后计算该模型在测试集上的MSE
3.将7次的取平均得到最后的MSE
hands-on-data-analysis 第三单元 模型搭建和评估

文章图片

from sklearn.model_selection import cross_val_score

lr = LogisticRegression(C=100) scores = cross_val_score(lr, X_train, y_train, cv=10)

# k折交叉验证分数 scores

输出:
array([0.82089552, 0.74626866, 0.74626866, 0.7761194 , 0.88059701, 0.8358209 , 0.76119403, 0.8358209 , 0.74242424, 0.75757576])

# 平均交叉验证分数 print("Average cross-validation score: :.2f".format(scores.mean()))

输出:
Average cross-validation score: 0.79

2.2.混淆矩阵
混淆矩阵是用来总结一个分类器结果的矩阵。对于k元分类,其实它就是一个k x k的表格,用来记录分类器的预测结果。
hands-on-data-analysis 第三单元 模型搭建和评估

文章图片

混淆矩阵的方法在sklearn中的sklearn.metrics模块
混淆矩阵需要输入真实标签和预测标签
精确率、召回率以及f-分数可使用classification_report模块
实际上模型的好坏,看混淆矩阵的主对角线即可。
from sklearn.metrics import confusion_matrix

# 训练模型 lr = LogisticRegression(C=100) lr.fit(X_train, y_train)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class=auto, n_jobs=None, penalty=l2, random_state=None, solver=lbfgs, tol=0.0001, verbose=0, warm_start=False)

# 模型预测结果 pred = lr.predict(X_train)

# 混淆矩阵 confusion_matrix(y_train, pred)

array([[354,58], [ 83, 173]])

# 分类报告 from sklearn.metrics import classification_report

# 精确率、召回率以及f1-score print(classification_report(y_train, pred))

precisionrecallf1-scoresupport00.810.860.83412 10.750.680.71256accuracy0.79668 macro avg0.780.770.77668 weighted avg0.790.790.79668

2.3.ROC曲线
ROC曲线起源于第二次世界大战时期雷达兵对雷达的信号判断。当时每一个雷达兵的任务就是去解析雷达的信号,但是当时的雷达技术还没有那么先进,存在很多噪声,所以每当有信号出现在雷达屏幕上,雷达兵就需要对其进行破译。有的雷达兵比较谨慎,凡是有信号过来,他都会倾向于解析成是敌军轰炸机,有的雷达兵又比较神经大条,会倾向于解析成是飞鸟。在这种情况下就急需一套评估指标来帮助他汇总每一个雷达兵的预测信息以及来评估这台雷达的可靠性。于是,最早的ROC曲线分析方法就诞生了。在那之后,ROC曲线就被广泛运用于医学以及机器学习领域。
ROC的全称是Receiver Operating Characteristic Curve,中文名字叫【受试者工作特征曲线】
ROC曲线在sklearn中的模块为sklearn.metrics
ROC曲线下面所包围的面积越大越好
from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test)) plt.plot(fpr, tpr, label="ROC Curve") plt.xlabel("FPR") plt.ylabel("TPR (recall)") # 找到最接近于0的阈值 close_zero = np.argmin(np.abs(thresholds)) plt.plot(fpr[close_zero], tpr[close_zero], o, markersize=10, label="threshold zero", fillstyle="none", c=k, mew=2) plt.legend(loc=4)

hands-on-data-analysis 第三单元 模型搭建和评估

文章图片

3.参考资料【机器学习】Cross-Validation(交叉验证)详解 - 知乎 (zhihu.com)
https://www.jianshu.com/p/2ca96fce7e81

    推荐阅读