Datawhale-零基础入门数据挖掘|Datawhale-零基础入门数据挖掘 - 二手车交易价格预测-特征工程


Datawhale-零基础入门数据挖掘 - 二手车交易价格预测-特征工程

  • 1.baseline上面的特征工程
    • 1.1异常值处理
    • 1.2数据清洗
    • 1.3label的分布不服从正态分布,对price去log
    • 1.4特征构造:汽车的使用时间,从邮编中提取城市信息等等(baseline上用的都用了,此处就不提供代码了)
  • 2.自己的一些思路(仅供参考)
    • 2.1 根据baseline上构建brand与price得到统计特征,在此基础上,我也构建了regioncode与price,model与price的统计特征
    • 2.2根据baseline上构建brand与price得到统计特征,构造了brand,regioncode与price,brand,model与price,model,regioncode与price的统计特征,
    • 2.3 对于匿名特征,从他们与price的相关性分析可得出v_3与v_8冗余,即保留一个即可。由于匿名特征无法知道他们与price的关系,所以根据之前的特征重要性对其中几个强相关特征进行处理。处理包括:匿名特征之间的四则运算。
  • (进来了的话,望大佬们看一看最后一段,哈哈)
    • 希望上面的这些特征对大家有启发。这里我我想向大家要一份运行mae(price log处理后的)框架。哪位大佬能提供一下,不胜感激。
【Datawhale-零基础入门数据挖掘|Datawhale-零基础入门数据挖掘 - 二手车交易价格预测-特征工程】
1.baseline上面的特征工程 1.1异常值处理 1.2数据清洗 1.3label的分布不服从正态分布,对price去log 1.4特征构造:汽车的使用时间,从邮编中提取城市信息等等(baseline上用的都用了,此处就不提供代码了) 2.自己的一些思路(仅供参考) 2.1 根据baseline上构建brand与price得到统计特征,在此基础上,我也构建了regioncode与price,model与price的统计特征
train_gb = train_data.groupby("brand")**(#“regioncode”,"model")** all_info = {} for kind, kind_data in train_gb: info = {} kind_data = https://www.it610.com/article/kind_data[kind_data['price'] > 0] info['brand_amount'] = len(kind_data) info['brand_price_max'] = kind_data.price.max() info['brand_price_median'] = kind_data.price.median() info['brand_price_min'] = kind_data.price.min() info['brand_price_sum'] = kind_data.price.sum() info['brand_price_std'] = kind_data.price.std() info['brand_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2) all_info[kind] = info brand_fe = pd.DataFrame(all_info).T.reset_index().rename(columns={"index": "brand"}) df = df.merge(brand_fe, how='left', on='brand')

2.2根据baseline上构建brand与price得到统计特征,构造了brand,regioncode与price,brand,model与price,model,regioncode与price的统计特征,
rb_avg_price = df.groupby(['regionCode','brand'])['price'].mean().rename('rb_avg_price').reset_index() rb_max_price = df.groupby(['regionCode','brand'])['price'].max().rename('rb_max_price ').reset_index() rb_min_price = df.groupby(['regionCode','brand'])['price'].min().rename('rb_min_price ').reset_index() rb_std_price = df.groupby(['regionCode','brand'])['price'].std().rename('rb_std_price ').reset_index() rb_med_price = df.groupby(['regionCode','brand'])['price'].median().rename('rb_med_price ').reset_index()rm_avg_price = df.groupby(['brand','model'])['price'].mean().rename('rm_avg_price ').reset_index() rm_max_price = df.groupby(['brand','model'])['price'].max().rename('rm_max_price').reset_index() rm_min_price = df.groupby(['brand','model'])['price'].min().rename('rm_min_price').reset_index() rm_std_price = df.groupby(['brand','model'])['price'].std().rename('rm_std_price').reset_index() rm_med_price = df.groupby(['brand','model'])['price'].median().rename('rm_med_price').reset_index()mb_avg_price = df.groupby(['model''brand'])['price'].mean().rename('mb_avg_price ').reset_index() mb_max_price = df.groupby(['model''brand'])['price'].max().rename('mb_max_price').reset_index() mb_min_price = df.groupby(['model''brand'])['price'].min().rename('mb_min_price').reset_index() mb_std_price = df.groupby(['model''brand'])['price'].std().rename('mb_std_price').reset_index() mb_med_price = df.groupby(['model''brand'])['price'].median().rename('mb_med_price').reset_index()```python df = df.merge(rb_avg_price, on=['regionCode','brand'], how='left') df = df.merge(rb_max_price, on=['regionCode','brand'], how='left') df = df.merge(rb_min_price, on=['regionCode','brand'], how='left') df = df.merge(rb_std_price, on=['regionCode','brand'], how='left') df = df.merge(rb_med_price, on=['regionCode','brand'], how='left')df = df.merge(rm_avg_price, on=['regionCode','model'], how='left') df = df.merge(rm_max_price, on=['regionCode','model'], how='left') df = df.merge(rm_min_price, on=['regionCode','model'], how='left') df = df.merge(rm_std_price, on=['regionCode','model'], how='left') df = df.merge(rm_med_price, on=['regionCode','model'], how='left')df = df.merge(mb_avg_price, on=['model','brand'], how='left') df = df.merge(mb_max_price, on=['model','brand'], how='left') df = df.merge(mb_min_price, on=['model','brand'], how='left') df = df.merge(mb_std_price, on=['model','brand'], how='left') df = df.merge(mb_med_price, on=['model','brand'], how='left')

2.3 对于匿名特征,从他们与price的相关性分析可得出v_3与v_8冗余,即保留一个即可。由于匿名特征无法知道他们与price的关系,所以根据之前的特征重要性对其中几个强相关特征进行处理。处理包括:匿名特征之间的四则运算。
## ```python def get_feature(data):data['v_12*v_8']=data['v_12']+data['v_8'] data['v_12+v_8']=data['v_12']*data['v_8'] data['v_12/v_8']=data['v_12']/data['v_8'] data['v_12-v_8']=data['v_12']-data['v_8']data['v_12*v_0']=data['v_12']+data['v_0'] data['v_12+v_0']=data['v_12']*data['v_0'] data['v_12/v_0']=data['v_12']/data['v_0'] data['v_12-v_0']=data['v_12']-data['v_0']data['v_0*v_8']=data['v_0']+data['v_8'] data['v_0+v_8']=data['v_0']*data['v_8'] data['v_0/v_8']=data['v_0']/data['v_8'] data['v_0-v_8']=data['v_0']-data['v_8']data['v_12*v_8*v_0']=data['v_12']+data['v_8']+data['v_0'] data['v_12+v_8+v_0']=data['v_12']*data['v_8']*data['v_0'] data['v_12/v_8/v_0']=data['v_12']/data['v_8']*data['v_0'] data['v_12-v_8-v_0']=data['v_12']-data['v_8']*data['v_0'] return data

(进来了的话,望大佬们看一看最后一段,哈哈) 希望上面的这些特征对大家有启发。这里我我想向大家要一份运行mae(price log处理后的)框架。哪位大佬能提供一下,不胜感激。

    推荐阅读