PyTorch常用参数初始化方法详解 PyTorch常用参数初始化方法详解

1、均匀分布初始化 torch.nn.init.uniform_(tensor, a=0, b=1)
从均匀分布U(a, b)中采样，初始化张量。
参数：

- tensor - 需要填充的张量
- a - 均匀分布的下界
- b - 均匀分布的上界

【PyTorch常用参数初始化方法详解】例子：

w = torch.empty(3, 5) nn.init.normal_(w) """ tensor([[ 1.1741,0.6394,1.1788,0.4641, -0.6314], [-0.7085, -0.6837, -0.2689,0.8613,0.3535], [-0.3989,0.9127,0.0285,0.8026,0.6904]]) """

均匀分布详解：
若 $x$ 服从均匀分布，即 $x~U(a,b)$，其概率密度函数（表征随机变量每个取值有多大的可能性）为，
$f(x)=\left\{\begin{array}{l}\frac{1}{b-a}, \quad a 则有期望和方差，
$\begin{array}{c}E(x)=\int_{-\infty}^{\infty} x f(x) d x=\frac{1}{2}(a+b) \\D(x)=E\left(x^{2}\right)-[E(x)]^{2}=\frac{(b-a)^{2}}{12}\end{array}$
2、正态(高斯)分布初始化 torch.nn.init.normal_(tensor, mean=0.0, std=1.0)
从给定的均值和标准差的正态分布 $N\left(\right. mean, \left.s t d^{2}\right)$ 中生成值，初始化张量。
参数:

- tensor - 需要填充的张量
- mean - 正态分布的均值
- std - 正态分布的标准偏差

例子：

w = torch.Tensor(3, 5) torch.nn.init.normal_(w, mean=0, std=1) """ tensor([[-1.3903,0.4045,0.3048,0.7537, -0.5189], [-0.7672,0.1891, -0.2226,0.2913,0.1295], [ 1.4719, -0.3049,0.3144, -1.0047, -0.5424]]) """

正态分布详解:
若随机变量 $x$ 服从正态分布，即 $x \sim N\left(\mu, \sigma^{2}\right) $, 其概率密度函数为，
$f(x)=\frac{1}{\sigma \sqrt{2 \pi}} \exp \left(-\frac{\left(x-\mu^{2}\right)}{2 \sigma^{2}}\right)$
正态分布概率密度函数中一些特殊的概率值:

- 68.268949% 的面积在平均值左右的一个标准差 $\sigma$ 范围内 ($\mu \pm \sigma$)
- 95.449974% 的面积在平均值左右两个标准差 $2 \sigma$ 的范围内 ($\mu \pm 2 \sigma$)
- 99.730020% 的面积在平均值左右三个标准差 $3 \sigma$ 的范围内 ($\mu \pm 3 \sigma$)
- 99.993666% 的面积在平均值左右四个标准差 $4 \sigma$ 的范围内 ($\mu \pm 4 \sigma$)

$\mu=0$, $\sigma=1$ 时的正态分布是标准正态分布。
3. Xavier初始化 3.1 Xavier均匀分布初始化 torch.nn.init.xavier_uniform_(tensor, gain=1.0)

又称 Glorot 初始化，按照 Glorot, X. & Bengio, Y.(2010)在论文Understanding the difficulty of training deep feedforward neural networks 中描述的方法，从均匀分布 $U(?a, a)$ 中采样，初始化输入张量 $tensor$，其中 $a $ 值由下式确定：
$a=\text { gain } \times \sqrt{\frac{6}{\text { fan_in }+\text { fan_out }}}$
例子：

w = torch.Tensor(3, 5) nn.init.xavier_uniform_(w, gain=torch.nn.init.calculate_gain('relu')) """ tensor([[ 0.7695, -0.7687, -0.2561, -0.5307,0.5195], [-0.6187,0.4913,0.3037, -0.6374,0.9725], [-0.2658, -0.4051, -1.1006, -1.1264, -0.1310]]) """

3.2 Xavier正态分布初始化 torch.nn.init.xavier_normal_(tensor, gain=1.0)
又称 Glorot 初始化，按照 Glorot, X. & Bengio, Y.(2010)在论文Understanding the difficulty of training deep feedforward neural networks 中描述的方法，从均匀分布 $N\left(0, s t d^{2}\right)$ 中采样，初始化输入张量 $tensor$，其中 $std$ 值由下式确定：
$\operatorname{std}=\text { gain } \times \sqrt{\frac{2}{\text { fan_in }+\text { fan_out }}}$

参数:

- tensor - 需要初始化的张量
- gain - 可选的放缩因子

例子：

w = torch.arange(10).view(2,-1).type(torch.float32) torch.nn.init.xavier_normal_(w) """ tensor([[-0.3139, -0.3557,0.1285, -0.9556,0.3255], [-0.6212,0.3405, -0.4150, -1.3227, -0.0069]]) """

4. kaiming初始化 4.1 kaiming均匀分布初始化 torch.nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
又称 He 初始化，按照He, K. et al. (2015)在论文Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification中描述的方法，从均匀分布$U(?bound, bound)$ 中采样，初始化输入张量 tensor，其中 bound 值由下式确定：
$\text { bound }=\text { gain } \times \sqrt{\frac{3}{\text { fan_mode }}}$
参数:

- tensor - 需要初始化的张量；
- $\mathrm{a}$- 这层之后使用的 rectifier的斜率系数，用来计算gain =\sqrt{\frac{2}{1+\mathrm{a}^{2}}} (此参数仅在参数nonlinea rity为'leaky_relu'时生效)；
- mode - 可以为“fan_in”（默认）或“fan_out”。“fan_in”维持前向传播时权值方差，“fan_out”维持反向传播时的方差；
- nonlinearity - 非线性函数（nn.functional中的函数名），pytorch建议仅与“relu”或“leaky_relu”(默认)一起使用；

例子：

w = torch.Tensor(3, 5) torch.nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu') """ tensor([[-0.4362, -0.8177, -0.7034,0.7306, -0.6457], [-0.5749, -0.6480, -0.8016, -0.1434,0.0785], [ 1.0369, -0.0676,0.7430, -0.2484, -0.0895]]) """

4.2 kaiming正态分布初始化 torch.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
又称He初始化，按照He, K. et al. (2015)在论文Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification中描述的方法，从正态分布 $N\left(0, s t d^{2}\right)$ 中采样，初始化输入张量tensor，其中std值由下式确定：
参数:

- tensor - 需要初始化的张量；
- $\mathrm{a} $ - 这层之后使用的 rectifier 的斜率系数，用来计算 $gain =\sqrt{\frac{2}{1+\mathrm{a}^{2}}} $ (此参数仅在参数nonlinea rity为'leaky_relu'时生效)；
- mode - 可以为"fan_in" (默认) 或“fan_out"。"fan_in"维持前向传播时权值方差，"fan_out"维持反向传播时的方差；
- nonlinearity - 非线性函数 (nn.functional中的函数名)，pytorch建议仅与“relu”或"leaky_relu”(默认)一起使用；

5、正交矩阵初始化 torch.nn.init.orthogonal_(tensor, gain=1)
用一个(半)正交矩阵初始化输入张量，参考Saxe, A. et al. (2013) - Exact solutions to the nonlinear dynamics of learning in deep linear neural networks。输入张量必须至少有 2 维，对于大于 2 维的张量，超出的维度将被flatten化。
正交初始化可以使得卷积核更加紧凑，可以去除相关性，使模型更容易学到有效的参数。
参数:

- tensor - 需要初始化的张量
- gain - 可选的放缩因子

例子：

w = torch.Tensor(3, 5) torch.nn.init.orthogonal_(w) """ tensor([[ 0.7395, -0.1503,0.4474,0.4321, -0.2090], [-0.2625,0.0112,0.6515, -0.4770, -0.5282], [ 0.4554,0.6548,0.0970, -0.4851,0.3453]]) """

6、稀疏矩阵初始化 torch.nn.init.sparse_(tensor, sparsity, std=0.01)
将2维的输入张量作为稀疏矩阵填充，其中非零元素由正态分布 $N\left(0,0.01^{2}\right)$ 生成。参考Martens, J.(2010)的 Deep learning via Hessian-free optimization。
参数:

- tensor - 需要填充的张量
- sparsity - 每列中需要被设置成零的元素比例
- std - 用于生成非零元素的正态分布的标准偏差

例子：

w = torch.Tensor(3, 5) torch.nn.init.sparse_(w, sparsity=0.1) """ tensor([[-0.0026,0.0000,0.0100,0.0046,0.0048], [ 0.0106, -0.0046,0.0000,0.0000,0.0000], [ 0.0000, -0.0005,0.0150, -0.0097, -0.0100]]) """

7、常数初始化 torch.nn.init.constant_(tensor, val)
使值为常数 val 。
例子：

w=torch.Tensor(3,5) nn.init.constant_(w,1.2) """ tensor([[1.2000, 1.2000, 1.2000, 1.2000, 1.2000], [1.2000, 1.2000, 1.2000, 1.2000, 1.2000], [1.2000, 1.2000, 1.2000, 1.2000, 1.2000]]) """

8、单位矩阵初始化 torch.nn.init.eye_(tensor)
将二维 tensor 初始化为单位矩阵（the identity matrix）
例子：