场景分割的对偶注意力网络.

DAnet设计了Dual Attention,同时引入了空间注意力和通道注意力。其中,Position Attention可以在位置上捕捉任意两个位置之间的上下文信息,而Channel Attention可以捕捉通道维度上的上下文信息;两者都是通过自注意力机制实现的。

class DAHead(nn.Module):
    def __init__(self, in_channels, num_classes):
        super(DAHead, self).__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels, in_channels//4, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//4),
            nn.ReLU(),
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(in_channels, in_channels//4, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//4),
            nn.ReLU(),
        )
        
        self.conv3 = nn.Sequential(
            nn.Conv2d(in_channels//4, in_channels//4, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//4),
            nn.ReLU(),
        )
        
        self.conv4 = nn.Sequential(
            nn.Conv2d(in_channels//4, in_channels//8, kernel_size=3, padding=1, bias=False),
            nn.BatchNorm2d(in_channels//8),
            nn.ReLU(),
            nn.Conv2d(in_channels//8, num_classes, kernel_size=3, padding=1, bias=False),
        )
 
        self.PositionAttention = PositionAttention(in_channels//4)
        self.ChannelAttention = ChannelAttention()
        
    def forward(self, x):
        x_PA = self.conv1(x)
        x_CA = self.conv2(x)
        PosionAttentionMap = self.PositionAttention(x_PA)
        ChannelAttentionMap = self.ChannelAttention(x_CA)
        output = self.conv3(PosionAttentionMap + ChannelAttentionMap)
        output = nn.functional.interpolate(output, scale_factor=8, mode="bilinear",align_corners=True)
        output = self.conv4(output)
        return output


class DAnet(nn.Module):
    def __init__(self, num_classes):
        super(DAnet, self).__init__()
        self.ResNet50 = IntermediateLayerGetter(
            resnet50(pretrained=False, replace_stride_with_dilation=[False, True, True]),
            return_layers={'layer4': 'stage4'}
        )
        self.decoder = DAHead(in_channels=2048, num_classes=num_classes)
        
    def forward(self, x):
        feats = self.ResNet50(x)
        # self.ResNet50返回的是一个字典类型的数据.
        x = self.decoder(feats["stage4"])
        return x

⚪ Position attention module

对于空间注意力的实现,首先将特征图A($C×H×W$)输入到卷积模块中,生成B($C×H×W$)和C($C×H×W$),将BC reshape成($C×N$)维度,其中$N=H×W$,$N$就是像素点的个数。随后,将B矩阵转置后和C矩阵相乘,将结果输入到softmax中,得到一个空间注意力图S($N×N$)。矩阵的乘法相当于让每一个像素点之间都产生了联系,从而计算任意两个位置之间的相似度$s_{ji}$。其中两个位置相似度越高,这个值就越大。

\[s_{j i}=\frac{\exp \left(B_i \cdot C_j\right)}{\sum_{i=1}^N \exp \left(B_i \cdot C_j\right)}\]

同样,A输入到另一个卷积层生成新的特征映射D($C×H×W$),reshape成$C×N$后与上述的空间注意力图S的转置进行相乘,这样就得到了$C×N$大小的矩阵,再将这个矩阵reshape成原来的$C×H×W$大小。将这个矩阵乘以一个系数$α$,然后加上原始的特征图A。这样就实现了空间自注意力机制。需要注意的是,这个$α$值是可学习参数,初始化为$0$。

\[E_j=\alpha \sum_{i=1}^N\left(s_{j i} D_i\right)+A_j\]
class PositionAttention(nn.Module):
    def __init__(self, in_channels):
        super(PositionAttention, self).__init__()
        self.convB = nn.Conv2d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
        self.convC = nn.Conv2d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
        self.convD = nn.Conv2d(in_channels, in_channels, kernel_size=1, padding=0, bias=False)
        #创建一个可学习参数a作为权重,并初始化为0.
        self.gamma = torch.nn.Parameter(torch.FloatTensor(1), requires_grad=True)
        self.gamma.data.fill_(0.)
        self.softmax = nn.Softmax(dim=2)
        
    def forward(self, x):
        b,c,h,w = x.size()
        B = self.convB(x)
        C = self.convB(x)
        D = self.convB(x)
        S = self.softmax(torch.matmul(B.view(b, c, h*w).transpose(1, 2), C.view(b, c, h*w)))
        E = torch.matmul(D.view(b, c, h*w), S.transpose(1, 2)).view(b,c,h,w)
        #gamma is a parameter which can be training and iter
        E = self.gamma * E + x
        
        return E

⚪ Channel attention module

Channel Attention机制的实现与Position Attention类似,主要差异在于计算通道注意力时没有通过任何卷积层来嵌入特征。作者的解释是这样可以保留原始通道之间的关系。

特征图A($C×H×W$)reshape成$C×N$的矩阵,分别经过转置、矩阵乘法、softmax到注意力图X($C×C$)。

\[x_{j i}=\frac{\exp \left(A_i \cdot A_j\right)}{\sum_{i=1}^C \exp \left(A_i \cdot A_j\right)}\]

随后这个注意力图Xreshape成$C×N$的A矩阵进行矩阵乘法,得到的输出($C×N$)再reshape成$C×H×W$和原始特征图A进行加权。$β$是一个可学习参数,初始化为$0$。

\[E_j=\beta \sum_{i=1}^C\left(x_{j i} A_i\right)+A_j\]
class ChannelAttention(nn.Module):
    def __init__(self):
        super(ChannelAttention, self).__init__()
        self.beta = torch.nn.Parameter(torch.FloatTensor(1), requires_grad=True)
        self.beta.data.fill_(0.)
        self.softmax = nn.Softmax(dim=2)
        
    def forward(self, x):
        b,c,h,w = x.size()
        X = self.softmax(torch.matmul(x.view(b, c, h*w), x.view(b, c, h*w).transpose(1, 2)))
        X = torch.matmul(X.transpose(1, 2), x.view(b, c, h*w)).view(b, c, h, w)
        X = self.beta * X + x
        return X