Learning Feature Pyramids for Human Pose Estimation

学习人体姿态估计中的特征金字塔.

paper：Learning Feature Pyramids for Human Pose Estimation

姿态估计是一个具有挑战性的计算机视觉任务，主要难点在于摄像机等原因造成的人体尺度的变化。当前金字塔模型在处理尺度变化方面具有良好的性能。本文设计了一个特征金字塔模块（Pyramid Residual Module，PRMs）来处理尺度变化。在给定输入特征的情况下，PRMs学习不同尺度的卷积滤波器，这些卷积滤波器在多分支网络中以不同的下采样率获得。

网络的整个框架采用堆叠沙漏网络(stacked hourglass networks)，是一种高度模块化的网络。沙漏网络目的在于捕获各种尺度的信息，它首先通过自底向上的方法对特征进行下采样，然后通过对特征图进行上采样来执行自上而下的处理，同时合并来自底层的高分辨率特征。这种自下而上、自上而下的处理过程要重复几次，以建立一个“堆叠沙漏”网络，在每个堆叠的末端进行中间监督。

然而，沙漏网络的残差单元只能在一个尺度上捕获视觉模式或语义。本文作者所提出的金字塔残差模块能够捕捉多尺度视觉模式或语义。PRM的结构示意图如下，虚线表示恒等映射。

PRM-A为不同级别的金字塔生成单独的输入特征映射
PRM-b为所有级别的金字塔使用共享输入
PRM-C使用连接而不是加法来组合从金字塔生成的特征
PRM-D使用扩展卷积，而不是通过池化来构建金字塔

传统的池化层被广泛使用，但是池化层会使得分辨率下降过快，池化过程过于粗糙。本文使用了fractional max-pooling方法，把输入区域随机划分为与输出尺寸相同的不均匀的子区域，并对每个子区域执行最大池化操作。第$c$层金字塔特征的下采样比例为$s_c=2^{-M\frac{c}{C}}$，其中$c=0,…,C,M\geq 1$。$s_c$的取值范围为$[2^{-M},1]$。当$c=0$时，下采样比例为$1$，和原图一样大。在实验中作者设置$M=1, C=4$（也表示金字塔有五层），最小的下采样有原始输入分辨率的一半。

class BnReluConv(nn.Module):
	def __init__(self, inChannels, outChannels, kernelSize = 1, stride = 1, padding = 0):
		super(BnReluConv, self).__init__()
		self.bn = nn.BatchNorm2d(inChannels)
		self.relu = nn.ReLU()
		self.conv = nn.Conv2d(inChannels, outChannels, kernelSize, stride, padding)

	def forward(self, x):
		x = self.bn(x)
		x = self.relu(x)
		x = self.conv(x)
		return x

class Pyramid(nn.Module):
	def __init__(self, D, cardinality, inputRes):
		super(Pyramid, self).__init__()
		self.cardinality = cardinality
		scale = 2**(-1/self.cardinality)
		_scales = []
		for card in range(self.cardinality):
			temp = nn.Sequential(
					nn.FractionalMaxPool2d(2, output_ratio = scale**(card + 1)),
					nn.Conv2d(D, D, 3, 1, 1),
					nn.Upsample(size = inputRes)#, mode='bilinear')
				)
			_scales.append(temp)
		self.scales = nn.ModuleList(_scales)

	def forward(self, x):
		out = torch.zeros_like(x)
		for card in range(self.cardinality):
			out += self.scales[card](x)
		return out

class BnReluPyra(nn.Module):
	def __init__(self, D, cardinality, inputRes):
		super(BnReluPyra, self).__init__()
		self.bn = nn.BatchNorm2d(D)
		self.relu = nn.ReLU()
		self.pyra = Pyramid(D, cardinality, inputRes)

	def forward(self, x):
		x = self.bn(x)
		x = self.relu(x)
		x = self.pyra(x)
		return x

class PyraConvBlock(nn.Module):
	def __init__(self, inChannels, outChannels, inputRes, baseWidth, cardinality, type = 1):
		super(PyraConvBlock, self).__init__()
		self.branch1 = nn.Sequential(
				BnReluConv(inChannels, outChannels//2, 1, 1, 0),
				BnReluConv(outChannels//2, outChannels//2, 3, 1, 1)
			)
		self.branch2 = nn.Sequential(
				BnReluConv(inChannels, outChannels // baseWidth, 1, 1, 0),
				BnReluPyra(outChannels // baseWidth, cardinality, inputRes),
				BnReluConv(outChannels // baseWidth, outChannels//2, 1, 1, 0)
			)
		self.afteradd = BnReluConv(outChannels//2, outChannels, 1, 1, 0)

	def forward(self, x):
		x = self.branch2(x) + self.branch1(x)
		x = self.afteradd(x)
		return x

class SkipLayer(nn.Module):
	def __init__(self, inChannels, outChannels):
		super(SkipLayer, self).__init__()
		if (inChannels == outChannels):
			self.conv = None
		else:
			self.conv = nn.Conv2d(inChannels, outChannels, 1)

	def forward(self, x):
		if self.conv is not None:
			x = self.conv(x)
		return x

class ResidualPyramid(nn.Module):
	def __init__(self, inChannels, outChannels, inputRes, baseWidth, cardinality, type = 1):
		super(ResidualPyramid, self).__init__()
		self.cb = PyraConvBlock(inChannels, outChannels, inputRes, baseWidth, cardinality, type)
		self.skip = SkipLayer(inChannels, outChannels)

	def forward(self, x):
		out = self.cb(x)
		out = out + self.skip(x)
		return out