python|特征工程之分类变量的处理 python|数据挖掘

分类变量是表示类别或标记的。与数值型变量不同，分类变量的值是不能被排序的，故而又称为无序变量。
one-hot编码 【python|特征工程之分类变量的处理】独热编码（one-hot encoding）通常用于处理类别间不具有大小关系的特征。独热编码使用一组比特位表示不同的类别，每个比特位表示一个特征。因此，一个可能有k个类别的分类变脸就可以编码成为一个长度为k的特征向量。若变量不能同时属于多个类别，那这组值就只有一个比特位是‘开’的。
独热编码的优缺点：

独热编码解决了分类器不好处理属性数据的问题，在一定程度上也起到了扩充特征的作用。它的值只有0和1，不同的类型存储在垂直的空间。
当类别的数量很多时，特征空间会变得非常大。在这种情况下，一般可以用PCA来减少维度。而且one hot encoding+PCA这种组合在实际中也非常有用。使用稀疏向量节省空间配合特征选择降低维度

import pandas as pd from sklearn import linear_model

df = pd.DataFrame({'city':['SF','SF','SF','NYC','NYC','NYC','Seattle','Seattle','Seattle'], 'Rent':[3999, 4000, 4001, 3499, 3500, 3501, 2499, 2500, 2501]})

df['Rent'].mean()

3333.3333333333335

#将分类变量转换为one-hot编码并拟合一个线性回归模型 one_hot_df = pd.get_dummies(df, prefix=['city']) one_hot_df

	Rent	city_NYC	city_SF	city_Seattle
0	3999	0	1	0
1	4000	0	1	0
2	4001	0	1	0
3	3499	1	0	0
4	3500	1	0	0
5	3501	1	0	0
6	2499	0	0	1
7	2500	0	0	1
8	2501	0	0	1

model = linear_model.LinearRegression() model.fit(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']], one_hot_df['Rent']) model.coef_#获取线性回归模型的系数

array([ 166.66666667,666.66666667, -833.33333333])

model.intercept_#获取线性回归模型的截距

3333.3333333333335

model.score(one_hot_df[['city_NYC', 'city_SF', 'city_Seattle']], one_hot_df['Rent'])#获取模型的拟合优度R2

0.9999982857172245

使用one-hot编码时，截距表示目标变量rent的整体均值，每个线性系数表示相应城市的Rent均值与整体Rent均值有多大
虚拟编码虚拟编码在进行表示时只使用k-1个特征，除去了额外的自由度。没有被使用的那个特征通过一个全零向量来表示，它称为参照类。虚拟编码和one-hot都可以通过pandas.get_dummies实现

#用虚拟编码训练一个线性回归模型，指定drop_first标志来生成虚拟编码

dummy_df = pd.get_dummies(df, prefix=['city'], drop_first=True) dummy_df

	Rent	city_SF	city_Seattle
0	3999	1	0
1	4000	1	0
2	4001	1	0
3	3499	0	0
4	3500	0	0
5	3501	0	0
6	2499	0	1
7	2500	0	1
8	2501	0	1

model.fit(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent']) model.coef_

array([500., -1000.])

model.intercept_

3500.0

model.score(dummy_df[['city_SF', 'city_Seattle']], dummy_df['Rent'])

0.9999982857172245

使用虚拟编码时，偏差系数表示相应变量y对于参照类的均值，该例中参照类是city_NYC。第i个特征的系数等于第i个类别的均值与参照类均值的差。
效果编码效果编码与虚拟编码非常相似，区别在于参照类的用全部由-1组成的向量表示的

effect_df = dummy_df.copy() effect_df.loc[3:5, ['city_SF','city_Seattle']]= -1.0 effect_df

	Rent	city_SF	city_Seattle
0	3999	1.0	0.0
1	4000	1.0	0.0
2	4001	1.0	0.0
3	3499	-1.0	-1.0
4	3500	-1.0	-1.0
5	3501	-1.0	-1.0
6	2499	0.0	1.0
7	2500	0.0	1.0
8	2501	0.0	1.0

model.fit(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.coef_

array([ 666.66666667, -833.33333333])

model.intercept_

3333.3333333333335

model.score(effect_df[['city_SF', 'city_Seattle']], effect_df['Rent'])

0.9999982857172245

处理大型分类变量特征散列化散列函数是一种确定性函数，它可以将一个可能无界的整数映射到一个有限的整数范围【1，m】中。

import pandas as pd import json js = [] with open('yelp_academic_dataset_review.json') as f: for i in range(10000): js.append(json.loads(f.readline())) f.close()review_df = pd.DataFrame(js)# 定义m为唯一的business_id的数量 m = len(review_df.business_id.unique())

m

4174

from sklearn.feature_extraction import FeatureHasher

h = FeatureHasher(n_features = m , input_type='string') f = h.transform(review_df['business_id'])

review_df['business_id'].unique().tolist()[0:5]

['9yKzy9PApeiPPOUJEtnvkg', 'ZRJwVLyzEJq1VAihDhYiow', '6oRAC4uyJCsJl1X0WZpVSA', '_1QQZuf4zZOyFCvXc0o6Vg', '6ozycU1RpktNG2-1BroVtw']

f.toarray()

array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])

from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(review_df['business_id'])) print('Our hashed numpy array, in bytes: ', getsizeof(f))

Our pandas Series, in bytes:790152 Our hashed numpy array, in bytes:56

分箱计数

import pandas as pd

df = pd.read_csv('train_subset.csv')

len(df['device_id'].unique()) #查看训练集中有多少个唯一的特征

1075

df.head()

	id	hour	C1	banner_pos	site_id	site_domain	site_category	app_id	app_domain	...	device_type	device_conn_type	C14	C15	C16	C17	C19	C20	C21
0	1000009418151094273	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	...	1	2	15706	320	50	1722	35	-1	79
1	10000169349117863715	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	...	1	0	15704	320	50	1722	35	100084	79
2	10000371904215119486	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	...	1	0	15704	320	50	1722	35	100084	79
3	10000640724480838376	14102100	1005	0	1fbe01fe	f3845767	28905ebd	ecad2386	7801e8d9	...	1	0	15706	320	50	1722	35	100084	79
4	10000679056417042096	14102100	1005	1	fe8cc448	9166c161	0569f928	ecad2386	7801e8d9	...	1	0	18993	320	50	2161	35	-1	157

5 rows × 24 columns

def click_counting(x, bin_column): clicks = pd.Series( x[x['click'] > 0][bin_column].value_counts(), name='clicks') no_clicks = pd.Series( x[x['click'] < 1][bin_column].value_counts(), name='no_clicks')counts = pd.DataFrame([clicks, no_clicks]).T.fillna('0') counts['total'] = counts['clicks'].astype( 'int64') + counts['no_clicks'].astype('int64')return counts

def bin_counting(counts): counts['N+'] = counts['clicks'].astype('int64').divide( counts['total'].astype('int64')) counts['N-'] = counts['no_clicks'].astype('int64').divide( counts['total'].astype('int64')) counts['log_N+'] = counts['N+'].divide(counts['N-'])#If we wanted to only return bin-counting properties, we would filter here bin_counts = counts.filter(items=['N+', 'N-', 'log_N+']) return counts, bin_counts

bin_column = 'device_id' device_clicks = click_counting(df.filter(items = [bin_column, 'click']), bin_column) device_all, device_bin_counts = bin_counting(device_clicks)

len(device_bin_counts)

1075

device_all.sort_values(by = 'total', ascending = False).head(4)

	clicks	no_clicks	total	N+	N-	log_N+
a99f214a	1561	7163	8724	0.178932	0.821068	0.217925
c357dbff	2	15	17	0.117647	0.882353	0.133333
a167aa83	0	9	9	0.000000	1.000000	0.000000
3c0208dc	0	9	9	0.000000	1.000000	0.000000

from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(df.filter(items=['device_id', 'click']))) print('Our bin-counting feature, in bytes: ', getsizeof(device_bin_counts))