人工智能+大数据|特征工程(特征预处理(无量纲化处理))

【人工智能+大数据|特征工程(特征预处理(无量纲化处理))】
文章目录

  • 一、瞎解释
  • 二、归一化
  • 三、标准化★

一、瞎解释 特征预处理API
sklearn.preprocessing

为什么要做归一化/标准化?
无量纲化
特征的单位或者数量相差较大,这样某特征会‘绝对’最终结果,使得其他算法无法学习到其他特征。
二、归一化 将原始数据进行变换将数据映射到[0,1]之间(默认)
公式:
人工智能+大数据|特征工程(特征预处理(无量纲化处理))
文章图片

人工智能+大数据|特征工程(特征预处理(无量纲化处理))
文章图片


人工智能+大数据|特征工程(特征预处理(无量纲化处理))
文章图片

我们可以使用sklearn库中的MinMaxScaler(feature_range(0,1)):进行数据处理
案例:
import pandas as pd from sklearn.preprocessing import MinMaxScalerdef minmax_demo(): """ 归一化 :return: """ # 1、获取数据 data = https://www.it610.com/article/pd.read_csv('test00.csv') # 只要前三列数据 data = https://www.it610.com/article/data.iloc[:, :3] print("data:\n", data) # 2、实例化一个转换器类 transfer = MinMaxScaler() # 3、调用fit_transform() data_new = transfer.fit_transform(data) print("data_new:\n", data_new) return Noneif __name__ == '__main__': minmax_demo()

最终转换结果都在 0-1 区间内
data: heightweightchest measurement 0180700.88877 1190800.99665 2168600.65878 3159650.65598 4169560.55658 5173600.46058 6186760.69978 7178600.64979 8175750.89895 9176600.88488 10177900.79595 111681000.48789 121581020.55646 13168600.69585 14179800.65785 15183700.69578 16190660.89586 17196880.96527 18187910.62488 19182900.58484 20158700.58947 21159550.58484 22166550.59896 23178540.48487 24163690.68745 25156550.52621 26189890.66959 27156560.59595 28189980.59716 29169660.65479 30179550.99598 31177680.55257 32166760.69784 33169860.68745 34189890.69988 35188680.78955 36176590.55999 37177600.68747 38196800.64888 data_new: [[0.60.33333333 0.79875762] [0.850.54166667 1.] [0.30.1250.36972783] [0.0750.22916667 0.36450464] [0.3250.04166667 0.17908109] [0.4250.1250.] [0.750.45833333 0.44621038] [0.550.1250.35295764] [0.4750.43750.81774768] [0.50.1250.79150111] [0.5250.750.6256086 ] [0.30.95833333 0.05094484] [0.051.0.17885724] [0.30.1250.43887925] [0.5750.54166667 0.36799299] [0.6750.33333333 0.43874867] [0.850.250.81198351] [1.0.70833333 0.94146287] [0.7750.77083333 0.30648982] [0.650.750.23179809] [0.050.33333333 0.24043502] [0.0750.02083333 0.23179809] [0.250.02083333 0.25813793] [0.550.0.04531125] [0.1750.31250.42320966] [0.0.02083333 0.12242804] [0.8250.72916667 0.38989311] [0.0.04166667 0.25252299] [0.8250.91666667 0.25478016] [0.3250.250.36228478] [0.5750.02083333 0.99875016] [0.5250.29166667 0.17160072] [0.250.45833333 0.44259145] [0.3250.66666667 0.42320966] [0.8250.72916667 0.44639693] [0.80.29166667 0.61366986] [0.50.10416667 0.1854422 ] [0.5250.1250.42324696] [1.0.54166667 0.3512601 ]]Process finished with exit code 0

归一化缺点:如果最大值和最小值是异常值,则对结果影响很大
三、标准化★ 通过对原始数据进行变换,把数据变换到均值为0,标准差为1的范围内
人工智能+大数据|特征工程(特征预处理(无量纲化处理))
文章图片

对于标准化而言,如果出现异常值,则对最终结果的影响也不是很大
使用sklearn中的API—StandardScaler()
案例:
import pandas as pd from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScalerdef minmax_demo(): """ 归一化 :return: """ # 1、获取数据 data = https://www.it610.com/article/pd.read_csv('test00.csv') # 只要前三列数据 data = https://www.it610.com/article/data.iloc[:, :3] print("data:\n", data) # 2、实例化一个转换器类 # transfer = MinMaxScaler() transfer = StandardScaler() # 3、调用fit_transform() data_new = transfer.fit_transform(data) print("data_new:\n", data_new) return Noneif __name__ == '__main__': minmax_demo()

data: heightweightchest measurement 0180700.88877 1190800.99665 2168600.65878 3159650.65598 4169560.55658 5173600.46058 6186760.69978 7178600.64979 8175750.89895 9176600.88488 10177900.79595 111681000.48789 121581020.55646 13168600.69585 14179800.65785 15183700.69578 16190660.89586 17196880.96527 18187910.62488 19182900.58484 20158700.58947 21159550.58484 22166550.59896 23178540.48487 24163690.68745 25156550.52621 26189890.66959 27156560.59595 28189980.59716 29169660.65479 30179550.99598 31177680.55257 32166760.69784 33169860.68745 34189890.69988 35188680.78955 36176590.55999 37177600.68747 38196800.64888 data_new: [[ 0.40612393 -0.139331891.4864856 ] [ 1.295946030.566375082.26419106] [-0.66166258 -0.84503885 -0.17150918] [-1.46250247 -0.49218537 -0.19169434] [-0.57268037 -1.12732164 -0.90826759] [-0.21675154 -0.84503885 -1.60033029] [ 0.940017190.284092290.12405926] [ 0.22815951 -0.84503885 -0.23631797] [-0.038787120.213521591.55987308] [ 0.05019509 -0.845038851.45844265] [ 0.13917731.272082040.81734748] [-0.661662581.977789-1.40345287] [-1.551484682.11893039 -0.90913267] [-0.66166258 -0.845038850.09572795] [ 0.317141720.56637508 -0.17821354] [ 0.67307056 -0.139331890.09522332] [ 1.29594603 -0.421614671.53759732] [ 1.829839281.130940652.03797306] [ 1.02899941.34265273 -0.41589382] [ 0.584088351.27208204 -0.70454163] [-1.55148468 -0.13933189 -0.67116403] [-1.46250247 -1.19789233 -0.70454163] [-0.839627-1.19789233 -0.60275075] [ 0.22815951 -1.26846303 -1.42522401] [-1.10657363 -0.209902580.03517246] [-1.7294491-1.19789233 -1.12720451] [ 1.206963821.20151134 -0.09358004] [-1.7294491-1.12732164 -0.6244498 ] [ 1.206963821.83664761 -0.61572692] [-0.57268037 -0.42161467 -0.20027304] [ 0.31714172 -1.197892332.25936103] [ 0.1391773-0.28047328 -0.93717563] [-0.8396270.284092290.11007383] [-0.572680370.989799250.03517246] [ 1.206963821.201511340.12478016] [ 1.11798161 -0.280473280.77120997] [ 0.05019509 -0.91560955 -0.88368495] [ 0.1391773-0.845038850.03531664] [ 1.829839280.56637508 -0.24287815]]

    推荐阅读