开发者|torchvision 中 deform_conv2d 操作的经验性解析神经网络|计算机视觉|pytorch|深

作者丨Lart
编辑丨3D视觉开发者社区
?如果觉得文章内容不错，别忘了三连支持下哦~

导读本文重点通过实验性的分析，来针对可变形卷积的参数进行全面的分析，并提供一些实验性的代码，以期望帮助大家在不想阅读其核心底层代码的前提下，可以更好的理解和把握其运算过程。
最近的 torchvision 版本中更新了对可变形卷积的支持，且同时支持 v1 和 v2 两个版本。
可变形卷积由于通过巧妙的方式，将采样点位置坐标和具体的采样值关联起来，使得采样位置偏移量可以伴随着模型一同进行学习与更新。这种位置自适应的特性使得在目标检测、分割、分类三大计算机视觉领域中被广泛应用，例如最近的 Deformable DETR、CycleMLP（https://www.yuque.com/lart/papers/om3xb6）等。这里尤其关注于后者。
CycleMLP 通过可变形卷积实现了空间偏移操作（更多细节可见Pytorch中Spatial-Shift-Operation的5种实现策略），这也让人注意到了可变形卷积中采样偏移这一设定潜在的应用价值。由于 torchvision 文档中对于这一操作的介绍过于简略，以至于让人无法清晰理解这一操作各个参数的具体含义与应用方式，所以有了这篇文章。
本文重点通过实验性的分析，来针对可变形卷积的参数进行全面的分析，并提供一些实验性的代码，以期望帮助像我一样的使用者，在不想阅读其核心底层代码的前提下，可以更好的理解和把握其运算过程。
参数介绍 input (Tensor[batch_size, in_channels, in_height, in_width]): input tensor输入的数据。
offset (Tensor[batch_size, 2 * offset_groups * kernel_height * kernel_width, out_height, out_width]): offsets to be applied for each position in the convolution kernel.这用于对卷积过程中各个卷积核参数的作用在输入特征上的位置进行偏移，即所谓调整采样点。其与输入的各个通道一一对应，即这里的offset_groups最大为in_channels，最小为 1。
weight (Tensor[out_channels, in_channels // groups, kernel_height, kernel_width]): convolution weights, split into groups of size (in_channels // groups)实际卷积核的参数。要明白，可变形卷积也是卷积，只是采样点有所不同，另外 v2 中也对每次卷积操作添加了一个空间调制（可以理解为空间注意力）。
bias (Tensor[out_channels]): optional bias of shape (out_channels,). Default: None卷积的偏置参数。
stride (int or Tuple[int, int]): distance between convolution centers. Default: 1卷积划窗的步长。
padding (int or Tuple[int, int]): height/width of padding of zeroes around each image. Default: 0卷积操作在输入数据周围补零的数量。注意这个是对称补零的。如果只想单边补零，可以对输入特征直接使用F.pad进行预处理。
dilation (int or Tuple[int, int]): the spacing between kernel elements. Default: 1卷积的扩张率。
mask (Tensor[batch_size, offset_groups * kernel_height * kernel_width, out_height, out_width]): masks to be applied for each position in the convolution kernel. Default: None：作用在卷积操作中窗口内实际参与计算元素上的mask，可以简单理解为局部空间 attention 的作用。mask对应的offset_groups必须于前面offset中对应的offset_groups一致，否则会报错。因而可以合理推测，这里的mask和offset是严格对应的。
参数实验基本案例
先看代码示例：

import torch import torch.nn as nn from torchvision.ops import deform_conv2dclass DeformableConv2d(nn.Module): def __init__( self, in_dim, out_dim, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, *, offset_groups=1, with_mask=False ): super().__init__() assert in_dim % groups == 0 self.stride = stride self.padding = padding self.dilation = dilation self.weight = nn.Parameter(torch.empty(out_dim, in_dim // groups, kernel_size, kernel_size)) if bias: self.bias = nn.Parameter(torch.empty(out_dim)) else: self.bias = Noneself.with_mask = with_mask if with_mask: # batch_size, (2+1) * offset_groups * kernel_height * kernel_width, out_height, out_width self.param_generator = nn.Conv2d(in_dim, 3 * offset_groups * kernel_size * kernel_size, 3, 1, 1) else: self.param_generator = nn.Conv2d(in_dim, 2 * offset_groups * kernel_size * kernel_size, 3, 1, 1)def forward(self, x): if self.with_mask: oh, ow, mask = self.param_generator(x).chunk(3, dim=1) offset = torch.cat([oh, ow], dim=1) mask = mask.sigmoid() else: offset = self.param_generator(x) mask = None x = deform_conv2d( x, offset=offset, weight=self.weight, bias=self.bias, stride=self.stride, padding=self.padding, dilation=self.dilation, mask=mask, ) return xif __name__ == "__main__": deformable_conv2d = DeformableConv2d(in_dim=3, out_dim=4, kernel_size=1, offset_groups=3, with_mask=False) print(deformable_conv2d(torch.randn(1, 3, 5, 7)).shape)deformable_conv2d = DeformableConv2d(in_dim=3, out_dim=6, kernel_size=1, groups=3, offset_groups=3, with_mask=True) print(deformable_conv2d(torch.randn(1, 3, 5, 7)).shape)""" torch.Size([1, 4, 5, 7]) torch.Size([1, 6, 5, 7]) """

这里基于 torchvision 提供的函数构建了一个灵活的可变形卷积的模块，同时支持 v1 和 v2 的设定，但是没有手动初始化各个部分的参数，均使用 PyTorch 默认的初始化策略。当然也可以手动初始化，使得模块起始效果等价为更标准和简单的卷积操作。
offset_groups的含义
这部分的例子中，整体流程涉及到一些专门的设计。所以为了直观解释，会逐段来分析这些代码。
首先定义输入 tensor。
为了简单，这里仅仅使用 1x3x3x3 大小的输入作为示例。为了便于分析偏移采样效果，这里不使用随机初始化，而是使用对应位置的序号作为值。

import torch from torchvision.ops import deform_conv2dh = w = 3# batch_size, num_channels, out_height, out_widthx = torch.arange(h * w * 3, dtype=torch.float32).reshape(1, 3, h, w)

这里手动构造了 offset 的值。其形状为batch_size, 2 * offset_groups * kh * kw, out_height, out_width。
由 CycleMLP 代码我们可以知道，deform_conv2d中的 offset 的含义是每次卷积划窗中，相对于每个采样点原始位置的相对偏移量，所以是有正有负，正表示轴向位置，负表示反向轴向位置。
这里为了分析offset_groups的效果，我们将其设置为 3，即 offset 中包含三组不同的偏移值。这里定义为[0,-1], [0,1], [-1,0]，也就是分别相对于采样点左侧（W 轴反向）、右侧（W 轴正向），上方（H 轴反向）偏移一个像素。这三组偏移参数分别会对应到输入的三个通道上。即对于作用到输入的第一个通道的卷积核参数，会自动使用[0,-1]这组偏移参数。类似的，其他的通道也会对应使用对应所属分组的偏移参数。
对于每个输出位置上的结果，都有一次单独的计算过程，可变形卷积也会为他们分别对应一套独立的 offset，从而构成了 offset 的后两个维度out_height, out_width。
为了简化计算逻辑，这里对全局使用相同的偏移量。即为了获得一个输出通道上的结果，卷积在输入数据上的划窗过程中，对应于单一输入通道内部的卷积过程的偏移参数是一样的（当然同一偏移组对应的输入通道之间也是相同的）。下面代码通过repeat操作实现空间共享这一点。

# to show the effect of offset more intuitively, only the case of kh=kw=1 is considered hereoffset = torch.FloatTensor( [# create our predefined offset with offset_groups = 3 0, -1,# sample the left pixel of the centroid pixel 0, 1,# sample the right pixel of the centroid pixel -1, 0,# sample the top pixel of the centroid pixel ]# here, we divide the input channels into offset_groups groups with different offsets. ).reshape(1, 2 * 3 * 1 * 1, 1, 1) # here we use the same offset for each local neighborhood in the single channel # so we repeat the offset to the whole space: batch_size, 2 * offset_groups * kh * kw, out_height, out_width offset = offset.repeat(1, 1, h, w)

【开发者|torchvision 中 deform_conv2d 操作的经验性解析】为了直观观察 offset 的作用效果，这里对可变形卷积使用特定形式的权重，使整个可变形卷积操作等效为一种空间偏移操作。同时也为了说明offset_groups和输出通道数（即卷积核个数）无关（如果有关的话，那么按照常理，这里的offset_groups必须能够整除输出通道数），这里将输出卷积核个数设置为 5，即权重大小为(5,3,1,1)。
这里的 5 个卷积核都由 0 和 1 构成，因此可以仅保留指定输入通道上的原始数据。所以五个权重分别可以实现这样的效果：

[1, 0, 0] 仅保留输入的第 1 个通道
[0, 1, 0] 仅保留输入的第 2 个通道
[1, 1, 0] 将输入的第 1 个和第 2 个通道上的值加起来
[0, 0, 1] 仅保留输入的第 3 个通道
[0, 1, 0] 仅保留输入的第 2 个通道

weight = torch.FloatTensor( [ [1, 0, 0],# only extract the first channel of the input tensor [0, 1, 0],# only extract the second channel of the input tensor [1, 1, 0],# add the first and the second channels of the input tensor [0, 0, 1],# only extract the third channel of the input tensor [0, 1, 0],# only extract the second channel of the input tensor ] ).reshape(5, 3, 1, 1)

将这些构造的参数应用到可变形卷积上，得到如下效果：

deconv_shift = deform_conv2d(x, offset=offset, weight=weight) print(deconv_shift)""" tensor([[ [[ 0.,0.,1.],# offset=(0, -1) the first channel of the input tensor [ 0.,3.,4.],# output hw indices (1, 2) => (1, 2-1) => input indices (1, 1) [ 0.,6.,7.]], # output hw indices (2, 1) => (2, 1-1) => input indices (2, 0)[[10., 11.,0.],# offset=(0, 1) the second channel of the input tensor [13., 14.,0.],# output hw indices (1, 1) => (1, 1+1) => input indices (1, 2) [16., 17.,0.]], # output hw indices (2, 0) => (2, 0+1) => input indices (2, 1)[[10., 11.,1.],# offset=[(0, -1), (0, 1)], accumulate the first and second channels after being sampled with an offset. [13., 17.,4.], [16., 23.,7.]],[[ 0.,0.,0.],# offset=(-1, 0) the third channel of the input tensor [18., 19., 20.],# output hw indices (1, 1) => (1-1, 1) => input indices (0, 1) [21., 22., 23.]], # output hw indices (2, 2) => (2-1, 2) => input indices (1, 2)[[10., 11.,0.],# offset=(0, 1) the second channel of the input tensor [13., 14.,0.],# output hw indices (1, 1) => (1, 1+1) => input indices (1, 2) [16., 17.,0.]]# output hw indices (2, 0) => (2, 0+1) => input indices (2, 1) ]]) """

对于输出 tensor 的四个通道上的值，由于我们使用的权重的特殊性，使得这四个通道都和原始输入的四个通道有着明显的对应关系。从这些结果中我们可以看出如下几点关键内容：
1.offset 确实是基于采样点的位置的相对偏移量，正负与对应轴向的正反方向相关。例如，对于第 1 个通道，由于offset=(0,-1)的偏移作用，使得其等效实现了对输入的第 1 个通道的整体右移一个单位的效果。也就是采样过程中，采样点坐标整体沿着 W 轴反方向偏移了一个单位。而对于第 2 个通道，由于offset=(0, 1)的偏移作用，使得其等效实现了对输入的第 2 个通道的整体 z左移一个单位的效果。也就是采样过程中，采样点坐标整体沿着 W 轴正方向偏移了一个单位。
2.偏移后如果超出边界，会使用 0 代替采样值。
3.offset_groups与输入通道数相关，与输出通道数无关。这里是从反向证明的：

如果与输出通道数相关，那么偏移分组数目无法被输出通道数整除时，必然无法正常运行。
第 3 个通道的结果也可以反映出来。由于其是输入的前两个通道偏移后结果之和。如果 offset_groups
与卷积核相关，那么这里对于同一个卷积核，应该体现出来是同一种偏移形式，然而，这里的两个输入通道在计算过程中使用了不同的offset，所以这可以从侧面证明了与输入通道的关系。
第 5 通道，由于卷积核的作用，其仅与输入的第 2 通道有关，而其对应的 offset，与同样和输入的第 2 通道有关的第 2 个输出通道一致。这可以看出来 offset 与输入通道的绑定关系。

完整代码如下：

import torch from torchvision.ops import deform_conv2dh = w = 3# batch_size, num_channels, out_height, out_width x = torch.arange(h * w * 3, dtype=torch.float32).reshape(1, 3, h, w)# to show the effect of offset more intuitively, only the case of kh=kw=1 is considered here offset = torch.FloatTensor( [# create our predefined offset with offset_groups = 3 0, -1,# sample the left pixel of the centroid pixel 0, 1,# sample the right pixel of the centroid pixel -1, 0,# sample the top pixel of the centroid pixel ]# here, we divide the input channels into offset_groups groups with different offsets. ).reshape(1, 2 * 3 * 1 * 1, 1, 1) # here we use the same offset for each local neighborhood in the single channel # so we repeat the offset to the whole space: batch_size, 2 * offset_groups * kh * kw, out_height, out_width offset = offset.repeat(1, 1, h, w)weight = torch.FloatTensor( [ [1, 0, 0],# only extract the first channel of the input tensor [0, 1, 0],# only extract the second channel of the input tensor [1, 1, 0],# add the first and the second channels of the input tensor [0, 0, 1],# only extract the third channel of the input tensor [0, 1, 0],# only extract the second channel of the input tensor ] ).reshape(5, 3, 1, 1) deconv_shift = deform_conv2d(x, offset=offset, weight=weight) print(deconv_shift)""" tensor([[ [[ 0.,0.,1.],# offset=(0, -1) the first channel of the input tensor [ 0.,3.,4.],# output hw indices (1, 2) => (1, 2-1) => input indices (1, 1) [ 0.,6.,7.]], # output hw indices (2, 1) => (2, 1-1) => input indices (2, 0)[[10., 11.,0.],# offset=(0, 1) the second channel of the input tensor [13., 14.,0.],# output hw indices (1, 1) => (1, 1+1) => input indices (1, 2) [16., 17.,0.]], # output hw indices (2, 0) => (2, 0+1) => input indices (2, 1)[[10., 11.,1.],# offset=[(0, -1), (0, 1)], accumulate the first and second channels after being sampled with an offset. [13., 17.,4.], [16., 23.,7.]],[[ 0.,0.,0.],# offset=(-1, 0) the third channel of the input tensor [18., 19., 20.],# output hw indices (1, 1) => (1-1, 1) => input indices (0, 1) [21., 22., 23.]], # output hw indices (2, 2) => (2-1, 2) => input indices (1, 2)[[10., 11.,0.],# offset=(0, 1) the second channel of the input tensor [13., 14.,0.],# output hw indices (1, 1) => (1, 1+1) => input indices (1, 2) [16., 17.,0.]]# output hw indices (2, 0) => (2, 0+1) => input indices (2, 1) ]]) """