数据挖掘实战（二手车交易价格预测）数据挖掘实战

赛题数据数据来自某交易平台的二手车交易记录，总数据量超过40w，包含31列变量信息，其中15列为匿名变量。从中抽取15万条作为训练集，5万条作为测试集A，5万条作为测试集B，同时会对name、model、brand和regionCode等信息进行脱敏。
字段表

Field	Description
SaleID	交易ID，唯一编码
name	汽车交易名称，已脱敏
regDate	汽车注册日期，例如20160101，2016年01月01日
model	车型编码，已脱敏
brand	汽车品牌，已脱敏
bodyType	车身类型：豪华轿车：0，微型车：1，厢型车：2，大巴车：3，敞篷车：4，双门汽车：5，商务车：6，搅拌车：7
fuelType	燃油类型：汽油：0，柴油：1，液化石油气：2，天然气：3，混合动力：4，其他：5，电动：6
gearbox	变速箱：手动：0，自动：1
power	发动机功率：范围 [ 0, 600 ]
kilometer	汽车已行驶公里，单位万km
notRepairedDamage	汽车有尚未修复的损坏：是：0，否：1
regionCode	地区编码，已脱敏
seller	销售方：个体：0，非个体：1
offerType	报价类型：提供：0，请求：1
creatDate	汽车上线时间，即开始售卖时间
price	二手车交易价格（预测目标）
v系列特征	匿名特征，包含v0-14在内15个匿名特征

评测标准 MAE(Mean Absolute Error)。
结果提交

SaleID,price 150000,687 150001,1250 150002,2580 150003,1178

1. 导入函数工具箱

import numpy as np import pandas as pd import warnings import matplotlib import matplotlib.pyplot as plt import seaborn as sns from scipy.special import jn from IPython.display import display, clear_output import time warnings.filterwarnings('ignore') %matplotlib inline ## 模型预测的 from sklearn import linear_model from sklearn import preprocessing from sklearn.svm import SVR from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor ## 数据降维处理的 from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCA import lightgbm as lgb import xgboost as xgb ## 参数搜索和评价的 from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split from sklearn.metrics import mean_squared_error, mean_absolute_error

2. 数据读取

## 通过Pandas对于数据进行读取 Train_data = https://www.it610.com/article/pd.read_csv('used_car_train_20200313.csv', sep=' ') TestA_data = https://www.it610.com/article/pd.read_csv('used_car_testA_20200313.csv', sep=' ') ## 输出数据的大小信息 print('Train data shape:',Train_data.shape) print('TestA data shape:',TestA_data.shape)

Train data shape: (150000, 31) TestA data shape: (50000, 30)

2.1 数据简要浏览

## 通过.head() 简要浏览读取数据的形式 Train_data.head()

	SaleID	name	regDate	model	brand	bodyType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
0	0	736	20040402	30.0	6	1.0	0.0	60	12.5	...	0.235676	0.101988	0.129549	0.022816	0.097462	-2.881803	2.804097	-2.420821	0.795292	0.914762
1	1	2262	20030301	40.0	1	2.0	0.0	0	15.0	...	0.264777	0.121004	0.135731	0.026597	0.020582	-4.900482	2.096338	-1.030483	-1.722674	0.245522
2	2	14874	20040403	115.0	15	1.0	0.0	163	12.5	...	0.251410	0.114912	0.165147	0.062173	0.027075	-4.846749	1.803559	1.565330	-0.832687	-0.229963
3	3	71865	19960908	109.0	10	0.0	1.0	193	15.0	...	0.274293	0.110300	0.121964	0.033395	0.000000	-4.509599	1.285940	-0.501868	-2.438353	-0.478699
4	4	111080	20120103	110.0	5	1.0	0.0	68	5.0	...	0.228036	0.073205	0.091880	0.078819	0.121534	-1.896240	0.910783	0.931110	2.834518	1.923482

5 rows × 31 columns
2.2 数据信息查看

## 通过 .info() 简要可以看到对应一些数据列名，以及NAN缺失信息 Train_data.info()

RangeIndex: 150000 entries, 0 to 149999 Data columns (total 31 columns): #ColumnNon-Null CountDtype ---------------------------- 0SaleID150000 non-nullint64 1name150000 non-nullint64 2regDate150000 non-nullint64 3model149999 non-nullfloat64 4brand150000 non-nullint64 5bodyType145494 non-nullfloat64 6fuelType141320 non-nullfloat64 7gearbox144019 non-nullfloat64 8power150000 non-nullint64 9kilometer150000 non-nullfloat64 10notRepairedDamage150000 non-nullobject 11regionCode150000 non-nullint64 12seller150000 non-nullint64 13offerType150000 non-nullint64 14creatDate150000 non-nullint64 15price150000 non-nullint64 16v_0150000 non-nullfloat64 17v_1150000 non-nullfloat64 18v_2150000 non-nullfloat64 19v_3150000 non-nullfloat64 20v_4150000 non-nullfloat64 21v_5150000 non-nullfloat64 22v_6150000 non-nullfloat64 23v_7150000 non-nullfloat64 24v_8150000 non-nullfloat64 25v_9150000 non-nullfloat64 26v_10150000 non-nullfloat64 27v_11150000 non-nullfloat64 28v_12150000 non-nullfloat64 29v_13150000 non-nullfloat64 30v_14150000 non-nullfloat64 dtypes: float64(20), int64(10), object(1) memory usage: 35.5+ MB

## 通过 .columns 查看列名 Train_data.columns

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'], dtype='object')

TestA_data.info() #查看每一列类型和缺失值情况

RangeIndex: 50000 entries, 0 to 49999 Data columns (total 30 columns): #ColumnNon-Null CountDtype ---------------------------- 0SaleID50000 non-nullint64 1name50000 non-nullint64 2regDate50000 non-nullint64 3model50000 non-nullfloat64 4brand50000 non-nullint64 5bodyType48587 non-nullfloat64 6fuelType47107 non-nullfloat64 7gearbox48090 non-nullfloat64 8power50000 non-nullint64 9kilometer50000 non-nullfloat64 10notRepairedDamage50000 non-nullobject 11regionCode50000 non-nullint64 12seller50000 non-nullint64 13offerType50000 non-nullint64 14creatDate50000 non-nullint64 15v_050000 non-nullfloat64 16v_150000 non-nullfloat64 17v_250000 non-nullfloat64 18v_350000 non-nullfloat64 19v_450000 non-nullfloat64 20v_550000 non-nullfloat64 21v_650000 non-nullfloat64 22v_750000 non-nullfloat64 23v_850000 non-nullfloat64 24v_950000 non-nullfloat64 25v_1050000 non-nullfloat64 26v_1150000 non-nullfloat64 27v_1250000 non-nullfloat64 28v_1350000 non-nullfloat64 29v_1450000 non-nullfloat64 dtypes: float64(20), int64(9), object(1) memory usage: 11.4+ MB

2.3 数据统计信息浏览

## 通过 .describe() 可以查看数值特征列的一些统计信息 Train_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	150000.000000	150000.000000	1.500000e+05	149999.000000	150000.000000	145494.000000	141320.000000	144019.000000	150000.000000	150000.000000	...	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000	150000.000000
mean	74999.500000	68349.172873	2.003417e+07	47.129021	8.052733	1.792369	0.375842	0.224943	119.316547	12.597160	...	0.248204	0.044923	0.124692	0.058144	0.061996	-0.001000	0.009035	0.004813	0.000313	-0.000688
std	43301.414527	61103.875095	5.364988e+04	49.536040	7.864956	1.760640	0.548677	0.417546	177.168419	3.919576	...	0.045804	0.051743	0.201410	0.029186	0.035692	3.772386	3.286071	2.517478	1.288988	1.038685
min	0.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.168192	-5.558207	-9.639552	-4.153899	-6.546556
25%	37499.750000	11156.000000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243615	0.000038	0.062474	0.035334	0.033930	-3.722303	-1.951543	-1.871846	-1.057789	-0.437034
50%	74999.500000	51638.000000	2.003091e+07	30.000000	6.000000	1.000000	0.000000	0.000000	110.000000	15.000000	...	0.257798	0.000812	0.095866	0.057014	0.058484	1.624076	-0.358053	-0.130753	-0.036245	0.141246
75%	112499.250000	118841.250000	2.007111e+07	66.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265297	0.102009	0.125243	0.079382	0.087491	2.844357	1.255022	1.776933	0.942813	0.680378
max	149999.000000	196812.000000	2.015121e+07	247.000000	39.000000	7.000000	6.000000	1.000000	19312.000000	15.000000	...	0.291838	0.151420	1.404936	0.160791	0.222787	12.357011	18.819042	13.847792	11.147669	8.658418

8 rows × 30 columns

TestA_data.describe()

	SaleID	name	regDate	model	brand	bodyType	fuelType	gearbox	power	kilometer	...	v_5	v_6	v_7	v_8	v_9	v_10	v_11	v_12	v_13	v_14
count	50000.000000	50000.000000	5.000000e+04	50000.000000	50000.000000	48587.000000	47107.000000	48090.000000	50000.000000	50000.000000	...	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000
mean	174999.500000	68542.223280	2.003393e+07	46.844520	8.056240	1.782185	0.373405	0.224350	119.883620	12.595580	...	0.248669	0.045021	0.122744	0.057997	0.062000	-0.017855	-0.013742	-0.013554	-0.003147	0.001516
std	14433.901067	61052.808133	5.368870e+04	49.469548	7.819477	1.760736	0.546442	0.417158	185.097387	3.908979	...	0.044601	0.051766	0.195972	0.029211	0.035653	3.747985	3.231258	2.515962	1.286597	1.027360
min	150000.000000	0.000000	1.991000e+07	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.500000	...	0.000000	0.000000	0.000000	0.000000	0.000000	-9.160049	-5.411964	-8.916949	-4.123333	-6.112667
25%	162499.750000	11203.500000	1.999091e+07	10.000000	1.000000	0.000000	0.000000	0.000000	75.000000	12.500000	...	0.243762	0.000044	0.062644	0.035084	0.033714	-3.700121	-1.971325	-1.876703	-1.060428	-0.437920
50%	174999.500000	52248.500000	2.003091e+07	29.000000	6.000000	1.000000	0.000000	0.000000	109.000000	15.000000	...	0.257877	0.000815	0.095828	0.057084	0.058764	1.613212	-0.355843	-0.142779	-0.035956	0.138799
75%	187499.250000	118856.500000	2.007110e+07	65.000000	13.000000	3.000000	1.000000	0.000000	150.000000	15.000000	...	0.265328	0.102025	0.125438	0.079077	0.087489	2.832708	1.262914	1.764335	0.941469	0.681163
max	199999.000000	196805.000000	2.015121e+07	246.000000	39.000000	7.000000	6.000000	1.000000	20000.000000	15.000000	...	0.291618	0.153265	1.358813	0.156355	0.214775	12.338872	18.856218	12.950498	5.913273	2.624622

8 rows × 29 columns
3. 特征与标签构建 3.1 提取数值类型特征列名

numerical_cols = Train_data.select_dtypes(exclude = 'object').columns print(numerical_cols)

Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'], dtype='object')

3.2 构建训练和测试样本

## 选择特征列 feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']] feature_cols = [col for col in feature_cols if 'Type' not in col]## 提前特征列，标签列构造训练样本和测试样本 X_data = https://www.it610.com/article/Train_data[feature_cols] Y_data = Train_data['price']X_test= TestA_data[feature_cols]print('X train shape:',X_data.shape) print('X test shape:',X_test.shape)

X train shape: (150000, 18) X test shape: (50000, 18)

## 定义了一个统计函数，方便后续信息统计 def Sta_inf(data): print('_min',np.min(data)) print('_max:',np.max(data)) print('_mean',np.mean(data)) print('_ptp',np.ptp(data)) print('_std',np.std(data)) print('_var',np.var(data))

3.3 统计标签的基本分布信息

print('Sta of label:') Sta_inf(Y_data)

Sta of label: _min 11 _max: 99999 _mean 5923.327333333334 _ptp 99988 _std 7501.973469876635 _var 56279605.942732885

## 绘制标签的统计图，查看标签分布 plt.hist(Y_data) plt.show() plt.close()

文章图片

3.4 缺省值用-1填补

X_data = https://www.it610.com/article/X_data.fillna(-1) X_test = X_test.fillna(-1)

4. 模型训练与预测 4.1 利用xgb进行五折交叉验证查看模型的参数效果

## xgb-Model xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\ colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'scores_train = [] scores = []## 5折交叉验证方式 sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0) for train_ind,val_ind in sk.split(X_data,Y_data):train_x=X_data.iloc[train_ind].values train_y=Y_data.iloc[train_ind] val_x=X_data.iloc[val_ind].values val_y=Y_data.iloc[val_ind]xgr.fit(train_x,train_y) pred_train_xgb=xgr.predict(train_x) pred_xgb=xgr.predict(val_x)score_train = mean_absolute_error(train_y,pred_train_xgb) scores_train.append(score_train) score = mean_absolute_error(val_y,pred_xgb) scores.append(score)print('Train mae:',np.mean(score_train)) print('Val mae',np.mean(scores))

Train mae: 622.8365678300579 Val mae 714.0856745005866

4.2 定义xgb和lgb模型函数

def build_model_xgb(x_train,y_train): model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\ colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror' model.fit(x_train, y_train) return modeldef build_model_lgb(x_train,y_train): estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150) param_grid = { 'learning_rate': [0.01, 0.05, 0.1, 0.2], } gbm = GridSearchCV(estimator, param_grid) gbm.fit(x_train, y_train) return gbm

4.3 切分数据集（Train,Val）进行模型训练，评价和预测

## Split data with val x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)

print('Train lgb...') model_lgb = build_model_lgb(x_train,y_train) val_lgb = model_lgb.predict(x_val) MAE_lgb = mean_absolute_error(y_val,val_lgb) print('MAE of val with lgb:',MAE_lgb)print('Predict lgb...') model_lgb_pre = build_model_lgb(X_data,Y_data) subA_lgb = model_lgb_pre.predict(X_test) print('Sta of Predict lgb:') Sta_inf(subA_lgb)

Train lgb... MAE of val with lgb: 685.4365823513536 Predict lgb... Sta of Predict lgb: _min -519.1502598641224 _max: 88575.10877210615 _mean 5922.982425989068 _ptp 89094.25903197027 _std 7377.297141258001 _var 54424513.11041347

print('Train xgb...') model_xgb = build_model_xgb(x_train,y_train) val_xgb = model_xgb.predict(x_val) MAE_xgb = mean_absolute_error(y_val,val_xgb) print('MAE of val with xgb:',MAE_xgb)print('Predict xgb...') model_xgb_pre = build_model_xgb(X_data,Y_data) subA_xgb = model_xgb_pre.predict(X_test) print('Sta of Predict xgb:') Sta_inf(subA_xgb)

Train xgb... MAE of val with xgb: 705.3490605572383 Predict xgb... Sta of Predict xgb: _min -90.51186 _max: 88906.555 _mean 5925.287 _ptp 88997.07 _std 7369.0444 _var 54302816.0

4.4 进行两模型的结果加权融合

## 这里我们采取了简单的加权融合的方式 val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数，而真实情况下，price为负是不存在的，由此我们进行对应的后修正 print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))

MAE of val with Weighted ensemble: 680.3431035587981

sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb## 查看预测值的统计进行 plt.hist(Y_data) plt.show() plt.close()

文章图片

4.5 输出结果

sub = pd.DataFrame() sub['SaleID'] = X_test.index sub['price'] = sub_Weighted sub.to_csv('./sub_Weighted.csv',index=False)

sub.head()

	SaleID	price
0	0	39005.192073
1	1	384.560646
2	2	7818.745048
3	3	11812.956312
4	4	566.835466

继续思考与学习为什么删除掉某些列，画图观察某些列和最终价格的关系
价格与saleID（价格与交易名称也可仿照画出）：

# -*- coding: UTF-8 -*- plt.scatter(Train_data.SaleID, Train_data.price) plt.ylabel("price")# 设定纵坐标名称 plt.grid(b=True, which='major', axis='y') plt.title("saleid")

文章图片

可以看到没有明显规律
价格与注册时间（creatDate类似分析）：

# -*- coding: UTF-8 -*- plt.scatter(Train_data.regDate, Train_data.price) plt.ylabel("price")# 设定纵坐标名称 plt.grid(b=True, which='major', axis='y') plt.title("regDate")

文章图片

这里看到越新的车子价格有递增的趋势，是否要删掉该特征值得商榷
价格与销售个体

# -*- coding: UTF-8 -*- plt.scatter(Train_data.seller, Train_data.price) plt.ylabel("price")# 设定纵坐标名称 plt.grid(b=True, which='major', axis='y') plt.title("seller")

文章图片

可以看到个体销售基本占据了所有，而且价格分布也是比较广泛，所以不适合作为特征。
价格与车身类型（燃油类型类似）

# -*- coding: UTF-8 -*- plt.scatter(Train_data.bodyType, Train_data.price) plt.ylabel("price")# 设定纵坐标名称 plt.grid(b=True, which='major', axis='y') plt.title("bodyType")

文章图片

缺失值有没有更好的处理方法
从数据看到缺失值的列有：
model 149999 non-null float64
bodyType 145494 non-null float64
fuelType 141320 non-null float64
gearbox 144019 non-null float64
只有gearbox保留了，看下gearbox的分布：

# -*- coding: UTF-8 -*- plt.scatter(Train_data.gearbox, Train_data.price) plt.ylabel("price")# 设定纵坐标名称 plt.grid(b=True, which='major', axis='y') plt.title("gearbox")

文章图片

这里看到变速箱在（0,1）之间，而缺失值却用-1来补，会不会有问题？

Train_data.price[Train_data.gearbox == 0].plot(kind='kde') Train_data.price[Train_data.gearbox == 1].plot(kind='kde') plt.xlabel("price")# plots an axis lable plt.ylabel("proba") plt.legend(('auto:1', 'manual:0'),loc='best') # sets our legend for our graph.

这里是不同变速箱的价格分布

文章图片

xgb和lgb算法学习
xgboost 学习：提升树（boosting tree）（含公式推导）

交叉验证的使用，用于选择模型的参数
用其他回归方法尝试

Datawhale赛事专题学习资料零基础入门数据挖掘 - 二手车交易价格预测赛事专题分享
ps:供大家参考学习，具体还需要亲身实践
Baseline方案
基本方案介绍，提供方式notebook及视频讲解。
点此直达
【数据挖掘实战（二手车交易价格预测）】从0到1打比赛流程
赛题理解、数据分析、特征工程、模型训练等通用流程进行学习，提供方式notebook或视频。