Dual Attention Network for Scene Segmentation

场景分割的对偶注意力网络.

paper：Dual Attention Network for Scene Segmentation

DAnet设计了Dual Attention，同时引入了空间注意力和通道注意力。其中，Position Attention可以在位置上捕捉任意两个位置之间的上下文信息，而Channel Attention可以捕捉通道维度上的上下文信息；两者都是通过自注意力机制实现的。

关于Position Attention：所有的位置两两之间都有一个权重，这个权重的值由两个位置之间的相似性来决定，而不是由两个位置的距离来决定，即无论两个位置距离多远，只要他们相似度高，空间注意力机制就可以锁定这两个位置。
关于Channel Attention：在高级语义特征中，每一个通道都可以被认为是对于某一个类的特殊响应，增强拥有这种响应的特征通道可以有效的提高分割效果。

class DAHead(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(DAHead, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels, in_channels//4, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//4),
            nn.ReLU(),
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(in_channels, in_channels//4, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//4),
            nn.ReLU(),
        )
        
        self.conv3 = nn.Sequential(
            nn.Conv2d(in_channels//4, in_channels//4, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//4),
            nn.ReLU(),
        )
        
        self.conv4 = nn.Sequential(
            nn.Conv2d(in_channels//4, in_channels//8, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//8),
            nn.ReLU(),
            nn.Conv2d(in_channels//8, num_classes, kernel_size=3, padding=1, bias=False),
        )
 
        self.PositionAttention = PositionAttention(in_channels//4)
        self.ChannelAttention = ChannelAttention()
        
    def forward(self, x):
        x_PA = self.conv1(x)
        x_CA = self.conv2(x)
        PosionAttentionMap = self.PositionAttention(x_PA)
        ChannelAttentionMap = self.ChannelAttention(x_CA)
        output = self.conv3(PosionAttentionMap + ChannelAttentionMap)
        output = nn.functional.interpolate(output, scale_factor=8, mode="bilinear",align_corners=True)
        output = self.conv4(output)
        return output


class DAnet(nn.Module):
    def __init__(self, num_classes):
        super(DAnet, self).__init__()
        self.ResNet50 = IntermediateLayerGetter(
            resnet50(pretrained=False, replace_stride_with_dilation=[False, True, True]),
            return_layers={'layer4': 'stage4'}
        )
        self.decoder = DAHead(in_channels=2048, num_classes=num_classes)
        
    def forward(self, x):
        feats = self.ResNet50(x)
        # self.ResNet50返回的是一个字典类型的数据.
        x = self.decoder(feats["stage4"])
        return x

⚪ Position attention module

对于空间注意力的实现，首先将特征图A（$C×H×W$）输入到卷积模块中，生成B（$C×H×W$）和C（$C×H×W$），将B和C reshape成（$C×N$）维度，其中$N=H×W$，$N$就是像素点的个数。随后，将B矩阵转置后和C矩阵相乘，将结果输入到softmax中，得到一个空间注意力图S（$N×N$）。矩阵的乘法相当于让每一个像素点之间都产生了联系，从而计算任意两个位置之间的相似度$s_{ji}$。其中两个位置相似度越高，这个值就越大。

\[s_{j i}=\frac{\exp \left(B_i \cdot C_j\right)}{\sum_{i=1}^N \exp \left(B_i \cdot C_j\right)}\]

同样，A输入到另一个卷积层生成新的特征映射D（$C×H×W$），reshape成$C×N$后与上述的空间注意力图S的转置进行相乘，这样就得到了$C×N$大小的矩阵，再将这个矩阵reshape成原来的$C×H×W$大小。将这个矩阵乘以一个系数$α$，然后加上原始的特征图A。这样就实现了空间自注意力机制。需要注意的是，这个$α$值是可学习参数，初始化为$0$。

\[E_j=\alpha \sum_{i=1}^N\left(s_{j i} D_i\right)+A_j\]

class PositionAttention(nn.Module):
    def __init__(self, in_channels):
        super(PositionAttention, self).__init__()
        self.convB = nn.Conv2d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
        self.convC = nn.Conv2d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
        self.convD = nn.Conv2d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
        #创建一个可学习参数a作为权重,并初始化为0.
        self.gamma = torch.nn.Parameter(torch.FloatTensor(1), requires_grad=True)
        self.gamma.data.fill_(0.)
        self.softmax = nn.Softmax(dim=2)
        
    def forward(self, x):
        b,c,h,w = x.size()
        B = self.convB(x)
        C = self.convB(x)
        D = self.convB(x)
        S = self.softmax(torch.matmul(B.view(b, c, h*w).transpose(1, 2), C.view(b, c, h*w)))
        E = torch.matmul(D.view(b, c, h*w), S.transpose(1, 2)).view(b,c,h,w)
        #gamma is a parameter which can be training and iter
        E = self.gamma * E + x
        
        return E

⚪ Channel attention module

Channel Attention机制的实现与Position Attention类似，主要差异在于计算通道注意力时没有通过任何卷积层来嵌入特征。作者的解释是这样可以保留原始通道之间的关系。

特征图A（$C×H×W$）reshape成$C×N$的矩阵，分别经过转置、矩阵乘法、softmax到注意力图X（$C×C$）。

\[x_{j i}=\frac{\exp \left(A_i \cdot A_j\right)}{\sum_{i=1}^C \exp \left(A_i \cdot A_j\right)}\]

随后这个注意力图X与reshape成$C×N$的A矩阵进行矩阵乘法，得到的输出（$C×N$）再reshape成$C×H×W$和原始特征图A进行加权。$β$是一个可学习参数，初始化为$0$。

\[E_j=\beta \sum_{i=1}^C\left(x_{j i} A_i\right)+A_j\]

class ChannelAttention(nn.Module):
    def __init__(self):
        super(ChannelAttention, self).__init__()
        self.beta = torch.nn.Parameter(torch.FloatTensor(1), requires_grad=True)
        self.beta.data.fill_(0.)
        self.softmax = nn.Softmax(dim=2)
        
    def forward(self, x):
        b,c,h,w = x.size()
        X = self.softmax(torch.matmul(x.view(b, c, h*w), x.view(b, c, h*w).transpose(1, 2)))
        X = torch.matmul(X.transpose(1, 2), x.view(b, c, h*w)).view(b, c, h, w)
        X = self.beta * X + x
        return X