数据挖掘十大经典算法学习之C4.5决策树分类算法及信息熵相关数据挖掘学习笔记

Definition 决策树学习时应用最广的归纳推理算法之一。[1]它是一种逼近离散值函数的方法，对噪声数据有很好的健壮性且能够学习析取表达式。CLS， ID3，C4.5，CART均是决策树学习算法。
[1]归纳学习成立存在一个基本假设：任一假设如果在足够大的训练样例集中很好地逼近目标函数，它也能在未见实例中很好地逼近目标函数。“泛化能力”
决策树学习的归纳偏置是优先选择较小的树。
决策树通过把实例从根节点排列到某个叶子节点来分类实例，叶子节点即为实例所属的分类。树上的每一个结点指定了对实例的某个属性的测试，并且该结点的每一个后继分支对应于该属性的一个可能值。分类实例的方法是从这个树的根结点开始，测试这个结点指定的属性，然后按照给定实例的该属性值对应的树枝向下移动。然后这个过程在以新结点为根的子树上重复。
表-1为根据天气情况判断是否适合出去玩的数据集。
表-1

	Outlook	Temperature	Humidity	Windy	Play
1	sunny	hot	high	FALSE	no
2	sunny	hot	high	TRUE	no
3	overcast	hot	high	FALSE	yes
4	rainy	mild	high	FALSE	yes
5	rainy	cool	normal	FALSE	yes
6	rainy	cool	normal	TRUE	no
7	overcast	cool	normal	TRUE	yes
8	sunny	mild	high	FALSE	no
9	sunny	cool	normal	FALSE	yes
10	rainy	mild	normal	FALSE	yes
11	sunny	mild	normal	TRUE	yes
12	overcast	mild	high	TRUE	yes
13	overcast	hot	normal	FALSE	yes
14	rainy	mild	high	TRUE	no

图-1画出了一颗学习到的决策树。这个树根据天气情况分类“是否适合出去玩”。

图-1
Problem 为了使决策树最优，哪一个属性将在树的根节点被测试？分类能力最好的属性被选作树的根结点的测试。采用不同测试属性及其先后顺序将会生成不同的决策树。
定义一个统计属性，称为“信息增益”（information gain），用来衡量给定的属性区分训练样例的能力。度量信息增益的标准为“熵”（entropy）。信息量就是不确定性的多少，熵越大，信息的不确定性越大。
自信息量：log(1/P)
H(x)=?∑x∈XP(x)log2P(x)//P(x)表示x发生的概率。
信息增益：Gain(S,A)≡Entropy(S)?∑v∈Values(A) |Sv |/|S| Entropy(Sv)
Values(A)是属性A所有可能值得集合，Sv是S中属性A的值为v的子集。该等式的第一项就是原集合S的熵，第二项是用A分类后S的熵的期望值。第二项描述的期望熵就是每个子集的熵的加权和，权值为属于Sv的样例占原始样例S的比例|Sv |/|S|。
以表-1中的数据集为例。该数据集S有14个样例，9个正例（play=yes），5个负例（play=no）。
Entropy(S)=-(9/14)log2(9/14)-(5/14)log2(5/14)=0.940
整个数据集S中：Outlook=sunny,6; Outlook=rainy,4; Outlook=overcast,4.

	Outlook	Temperature	Humidity	Windy	Play
1	sunny	hot	high	FALSE	no
2	sunny	hot	【数据挖掘十大经典算法学习之C4.5决策树分类算法及信息熵相关】high	TRUE	no
3	overcast	hot	high	FALSE	yes
4	rainy	mild	high	FALSE	yes
5	rainy	cool	normal	FALSE	yes
6	rainy	cool	normal	TRUE	no
7	overcast	cool	normal	TRUE	yes
8	sunny	mild	high	FALSE	no
9	sunny	cool	normal	FALSE	yes
10	rainy	mild	normal	FALSE	yes
11	sunny	mild	normal	TRUE	yes
12	overcast	mild	high	TRUE	yes
13	overcast	hot	normal	FALSE	yes
14	rainy	mild	high	TRUE	no

Outlook=sunny, play=yes,2; Outlook=sunny, play=no,3; 共5个实例。
Entropy(Ssunny)= -(2/5)log2(2/5)-(3/5)log2(3/5)=0.971

	Outlook	Temperature	Humidity	Windy	Play
1	sunny	hot	high	FALSE	no
2	sunny	hot	high	TRUE	no
3	overcast	hot	high	FALSE	yes
4	rainy	mild	high	FALSE	yes
5	rainy	cool	normal	FALSE	yes
6	rainy	cool	normal	TRUE	no
7	overcast	cool	normal	TRUE	yes
8	sunny	mild	high	FALSE	no
9	sunny	cool	normal	FALSE	yes
10	rainy	mild	normal	FALSE	yes
11	sunny	mild	normal	TRUE	yes
12	overcast	mild	high	TRUE	yes
13	overcast	hot	normal	FALSE	yes
14	rainy	mild	high	TRUE	no

Outlook=rainy, play=yes,3; Outlook=rainy, play=no,2; 共5个实例。
Entropy(Srainy)= -(3/5)log2(3/5)-(2/5)log2(2/5)=0.971

	Outlook	Temperature	Humidity	Windy	Play
1	sunny	hot	high	FALSE	no
2	sunny	hot	high	TRUE	no
3	overcast	hot	high	FALSE	yes
4	rainy	mild	high	FALSE	yes
5	rainy	cool	normal	FALSE	yes
6	rainy	cool	normal	TRUE	no
7	overcast	cool	normal	TRUE	yes
8	sunny	mild	high	FALSE	no
9	sunny	cool	normal	FALSE	yes
10	rainy	mild	normal	FALSE	yes
11	sunny	mild	normal	TRUE	yes
12	overcast	mild	high	TRUE	yes
13	overcast	hot	normal	FALSE	yes
14	rainy	mild	high	TRUE	no

Outlook=overcast, play=yes,4; Outlook= overcast, play=no,0; 共4个实例。
Entropy(Sovercast)= -(4/4)log2(4/4)-(0/4)log2(0/4)=0//熵的定义中规定，如果所有成员属于同一类，那么熵为0。
Gain(S,Outlook)＝Entropy(S)?∑v∈Values(Outlook) |Sv |/|S| Entropy(Sv)
=Entropy(S)?(5/14)* Entropy(Ssunny) ?(5/14)* Entropy(Srainy) ?4/14 * Entropy(Sovercast)
=0.940-(5/14)* 0.971-(5/14)* 0.971-0=0.246
同样，可以计算Gain(S,Wind)=0.048，Gain(S,Humidity)=0.151，Gain(S,Temperature)=0.029。
属性Outlook的信息增益最大，也就是说，它能够减少的信息不确定性最多。因此，选择Outlook作为根节点的决策属性。

Also.to be continued. 需新增信息增益率相关知识。
Acknowledgements&References：
部分材料采自于实验室董X同学的讲座，谢谢董X友情提供资料。理论知识部分摘自于Tom M.Mithchell的《机器学习》。