Python编程学习(深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别)

满堂花醉三千客,一剑霜寒十四洲。这篇文章主要讲述Python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别相关的知识,希望能为你提供帮助。


python编程学习:深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别


目录
??深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display??
??读取源码??
??理解源代码??
??data与raw_data对比结果??
??X.shape  ??
??X_display.shape  ??
深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display

X,y = shap.datasets.adult()
X_display,y_display = shap.datasets.adult(display=True)

读取源码
def adult(display=False):
""" Return the Adult census data in a nice package. """
dtypes = [
("Age", "float32"), ("Workclass", "category"), ("fnlwgt", "float32"),
("Education", "category"), ("Education-Num", "float32"), ("Marital Status", "category"),
("Occupation", "category"), ("Relationship", "category"), ("Race", "category"),
("Sex", "category"), ("Capital Gain", "float32"), ("Capital Loss", "float32"),
("Hours per week", "float32"), ("Country", "category"), ("Target", "category")
]
raw_data = https://www.songbingjia.com/android/pd.read_csv(
cache(github_data_url + "adult.data"),
names=[d[0] for d in dtypes],
na_values="?",
dtype=dict(dtypes)
)
data = https://www.songbingjia.com/android/raw_data.drop(["Education"], axis=1)# redundant with Education-Num
filt_dtypes = list(filter(lambda x: not (x[0] in ["Target", "Education"]), dtypes))
data["Target"] = data["Target"] == " > 50K"
rcode =
"Not-in-family": 0,
"Unmarried": 1,
"Other-relative": 2,
"Own-child": 3,
"Husband": 4,
"Wife": 5

for k, dtype in filt_dtypes:
if dtype == "category":
if k == "Relationship":
data[k] = np.array([rcode[v.strip()] for v in data[k]])
else:
data[k] = data[k].cat.codes

if display:
return raw_data.drop(["Education", "Target", "fnlwgt"], axis=1), data["Target"].values
return data.drop(["Target", "fnlwgt"], axis=1), data["Target"].values

理解源代码

data与raw_data对比结果结论:
data:是基于raw_data读入的csv文件数据,为新定义的新数据,共计drop了3列(第1个红色矩形框),又进行了目标特征的二分类(第2个红色矩形框),最后进行了类别特征进行了数值化/编码化(第3个红色矩形框);经过处理后的数据均为数字列且目标特征为二分类的dataframe。
raw_data:为原始数据,从csv读入,仅经过drop了3列,其余原封不同输出数据。
X.shape 
(32561, 12) X.shape
ageworkclass...hours-per-week native-country
039State-gov...40United-States
150Self-emp-not-inc...13United-States
238Private...40United-States
353Private...40United-States
428Private...40Cuba
..................
3255627Private...38United-States
3255740Private...40United-States
3255858Private...40United-States
3255922Private...20United-States
3256052Self-emp-inc...40United-States

[32561 rows x 12 columns]

age

workclass

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

0

39

State-gov

13

Never-married

Adm-clerical

Not-in-family

White

Male

2174

0

40

United-States

1

50

Self-emp-not-inc

13

Married-civ-spouse

Exec-managerial

Husband

White

Male

0

0

13

United-States

2

38

Private

9

Divorced

Handlers-cleaners

Not-in-family

White

Male

0

0

40

United-States

3

53

Private

7

Married-civ-spouse

Handlers-cleaners

Husband

Black

Male

0

0

40

United-States

4

28

Private

13

Married-civ-spouse

Prof-specialty

Wife

Black

Female

0

0

40

Cuba

5

37

Private

14

Married-civ-spouse

Exec-managerial

Wife

White

Female

0

0

40

United-States

6

49

Private

5

Married-spouse-absent

Other-service

Not-in-family

Black

Female

0

0

16

Jamaica

7

52

Self-emp-not-inc

9

Married-civ-spouse

Exec-managerial

Husband

White

Male

0

0

45

United-States

8

31

Private

14

Never-married

Prof-specialty

Not-in-family

White

Female

14084

0

50

United-States

9

42

Private

13

Married-civ-spouse

Exec-managerial

Husband

White

Male

5178

0

40

United-States

X_display.shape 
(32561, 12) X_display.shape
ageworkclass...hours-per-week native-country
039State-gov...40United-States
150Self-emp-not-inc...13United-States
238Private...40United-States
353Private...40United-States
428Private...40Cuba
..................
3255627Private...38United-States
3255740Private...40United-States
3255858Private...40United-States
3255922Private...20United-States
3256052Self-emp-inc...40United-States

[32561 rows x 12 columns]

age

workclass

education-num

marital-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

0

39

State-gov

13

Never-married

Adm-clerical

Not-in-family

White

Male

2174

0

40

United-States

1

50

Self-emp-not-inc

13

Married-civ-spouse

Exec-managerial

Husband

White

Male

0

0

13

【Python编程学习(深度剖析shap.datasets.adult()源码中的X,y和X_display,y_display输出数区别)】

    推荐阅读