人体姿态估计中的多重上下文注意力.

本文为人体姿态估计任务设计了一种多重上下文注意力机制。首先采用堆叠沙漏网络(stacked hourglass networks)生成不同分辨率特征的注意力图,不同分辨率特征对应着不同的语义。然后利用CRF(Conditional Random Field)对注意力图中相邻区域的关联性进行建模。并同时结合了整体注意力模型和肢体部分注意力模型,整体注意力模型针对的是整体人体的全局一致性,部分注意力模型针对不同身体部分的详细描述,因此能够处理从局部显著区域到全局语义空间的不同粒度内容。另外还设计了新颖的沙漏残差单元(Hourglass Residual Units, HRUs)增加网络的接受野,可以学习得到不同尺度的特征。

1. Nested Hourglass Network

采用8-stack hourglass网络作为基础网络,并采用沙漏残差单元(Hourglass Residual Units, HRUs)代替残差单元:

class BnReluConv(nn.Module):
	def __init__(self, inChannels, outChannels, kernelSize = 1, stride = 1, padding = 0):
		super(BnReluConv, self).__init__()
		self.bn = nn.BatchNorm2d(inChannels)
		self.conv = nn.Conv2d(inChannels, outChannels, kernelSize, stride, padding)
		self.relu = nn.ReLU()

	def forward(self, x):
		x = self.bn(x)
		x = self.relu(x)
		x = self.conv(x)
		return x

class BnReluPoolConv(nn.Module):
		def __init__(self, inChannels, outChannels, kernelSize = 1, stride = 1, padding = 0):
			super(BnReluPoolConv, self).__init__()
			self.bn = nn.BatchNorm2d(inChannels)
			self.conv = nn.Conv2d(inChannels, outChannels, kernelSize, stride, padding)
			self.relu = nn.ReLU()

		def forward(self, x):
			x = self.bn(x)
			x = self.relu(x)
			x = F.max_pool2d(x, kernel_size=2, stride=2)
			x = self.conv(x)
			return x

class ConvBlock(nn.Module):
	def __init__(self, inChannels, outChannels):
		super(ConvBlock, self).__init__()
		self.brc1 = BnReluConv(inChannels, outChannels//2, 1, 1, 0)
		self.brc2 = BnReluConv(outChannels//2, outChannels//2, 3, 1, 1)
		self.brc3 = BnReluConv(outChannels//2, outChannels, 1, 1, 0)

	def forward(self, x):
		x = self.brc1(x)
		x = self.brc2(x)
		x = self.brc3(x)
		return x

class PoolConvBlock(nn.Module):
	def __init__(self, inChannels, outChannels):
		super(PoolConvBlock, self).__init__()
		self.brpc = BnReluPoolConv(inChannels, outChannels, 3, 1, 1)
		self.brc = BnReluConv(outChannels, outChannels, 3, 1, 1)

	def forward(self, x):
		x = self.brpc(x)
		x = self.brc(x)
		x = F.interpolate(x, scale_factor=2)
		return x

class SkipLayer(nn.Module):
	def __init__(self, inChannels, outChannels):
		super(SkipLayer, self).__init__()
		if (inChannels == outChannels):
			self.conv = None
		else:
			self.conv = nn.Conv2d(inChannels, outChannels, 1)

	def forward(self, x):
		if self.conv is not None:
			x = self.conv(x)
		return x

class HourGlassResidual(nn.Module):
	def __init__(self, inChannels, outChannels):
		super(HourGlassResidual, self).__init__()
		self.cb = ConvBlock(inChannels, outChannels)
		self.pcb = PoolConvBlock(inChannels, outChannels)
		self.skip = SkipLayer(inChannels, outChannels)

	def forward(self, x):
		out = self.cb(x)
		out = out + self.pcb(x)
		out = out + self.skip(x)
		return out

2. Hierarchical Attention Mechanism

不同的 stack 具有不同的语义:底层 stacks 对应局部特征,高层 stacks 对应全局特征。因此不同 stacks 生成的注意力图编码着不同的语义.

底层stacks (stack1 - stack4)采用多分辨率注意力(Multi-Resolution Attention)来对整体人体进行编码。

高层stacks (stack5 - stack8)设计分层的 coarse-to-fine 注意力机制对局部关节点进行缩放处理。