PointCNN:在X变换点云上的卷积.

卷积神经网络中的卷积算子能够捕捉图像中的空间局部相关性,但是点云是无规则且无序的,因此直接使用卷积核对点特征进行卷积将会导致形状信息的丢失以及点云顺序的变化。假设$C$维输入特征的无序集\(\mathbb{F}=\left\{ {f_a},{f_b},{f_c},{f_d} \right\}\),并且有大小为$4 \times C$的卷积核\(\mathbf{K}=\left[k_{\alpha}, k_{\beta}, k_{\gamma}, k_{\delta}\right]^T\)。

这些点的顺序是任意的。根据上图中的顺序,输入特征集合\(\mathbb{F}\)在$(ii)$和$(iii)$中可以写成\(\left[f_{a}, f_{b}, f_{c}, f_{d}\right]^{T}\),在$(iv)$中可以写成\(\left[f_{c}, f_{a}, f_{b}, f_{d}\right]^{T}\)。基于此,如果直接使用卷积操作,三种情况的输出特征可以表示为下式。此时在任何情况下\(f_{i i} \equiv f_{i i i}\)都成立,而且在大多数情况下\(f_{i i i} \neq f_{i v}\)成立。这表明直接使用卷积会导致形状信息的缺失\((f_{i i} \equiv f_{i i i})\)和顺序的变化\((f_{i i i} \neq f_{i v})\)。

\[\begin{aligned} f_{i i} & =\operatorname{Conv}\left(\mathbf{K}, \left[f_a, f_b, f_c, f_d\right]^T\right) \\ f_{i i i} & =\operatorname{Conv}\left(\mathbf{K}, \left[f_a, f_b, f_c, f_d\right]^T\right) \\ f_{i v} & =\operatorname{Conv}\left(\mathbf{K}, \left[f_c, f_a, f_b, f_d\right]^T\right) \end{aligned}\]

本文提出了一个简单通用的框架PointCNN,用于点云的特征学习。PointCNN从输入点中学习\(\mathcal{X}\)-变换,将点的排序映射到一个潜在的顺序。\(\mathcal{X}\)-变换首先通过多层感知机对$K$个输入点云\(\left(p_{1}, p_{2}, \ldots, p_{K}\right)\)的坐标进行变换,相当于学习一个$K\times K$的排序矩阵,随后对变换后的特征进行卷积。上述的步骤可以称为\(\mathcal{X}\)-卷积,是PointCNN中的一个基本块。\(\mathcal{X}\)-卷积能够考虑到点的形状,同时具有排序不变性。

可以注意到,因为\(\mathcal{X}_{i i}\)和\(\mathcal{X}_{i i i}\)是从不同形状中学习得到的,所以它们可以有所不同,从而对输入特性施加相应的权重,并且达到\(f_{i i} \neq f_{i i i}\)的效果。对于\(\mathcal{X}_{i i i}\)和\(\mathcal{X}_{i v}X\),如果它们通过学习后能够满足\(\mathcal{X}_{i i i}=\mathcal{X}_{i v} \times \Pi\),其中$\Pi$是将$(c, a, b, d)$排序为$(a, b, c, d)$的排序矩阵的话,那么也可以达到\(f_{i i i} \equiv f_{i v}\)的效果。

\[\begin{aligned} f_{i i} & =\operatorname{Conv}\left(\mathbf{K}, \mathcal{X}_{i i} \times\left[f_a, f_b, f_c, f_d\right]^T\right) \\ f_{i i i} & =\operatorname{Conv}\left(\mathbf{K}, \mathcal{X}_{i i i} \times\left[f_a, f_b, f_c, f_d\right]^T\right) \\ f_{i v} & =\operatorname{Conv}\left(\mathbf{K}, \mathcal{X}_{i v} \times\left[f_c, f_a, f_b, f_d\right]^T\right) \end{aligned}\]

1. \(\mathcal{X}\)-卷积

把\(\mathcal{X}\)-卷积定义为在点云的局部区域中进行的操作,由于输出特性应该与表示点\(\left\{p_{2, i}\right\}\)相关联,因此\(\mathcal{X}\)-卷积将它们在\(\left\{p_{1, i}\right\}\)中的邻域点、相关的特性作为输入以进行卷积。

为了更简单地描述,记$p$为\(\left\{p_{2, i}\right\}\)中的表示点,$p$的特征为$f$,$p$在\(\left\{p_{1, i}\right\}\)的相邻点为\(\mathbb{N}\)。因此对于特定点$p$而言,\(\mathcal{X}\)-卷积的输入为\(\mathbb{S}=\left\{\left(p_{i}, f_{i}\right): p_{i} \in \mathbb{N}\right\}\)。\(\mathbb{S}\)是一组无序的集合。在不失一般性的情况下,\(\mathbb{S}\)可以写成$K \times Dim$的矩阵\(\mathbf{P}=\left(p_1, p_2, \ldots, p_{K}\right)^{T}\)和$K \times C_1$的矩阵\(\mathbf{F}=\left(f_1, f_2, \ldots, f_{K}\right)^{T}\),\(\mathbf{K}\)表示要训练的卷积核。有了这些输入,就能计算$p$的输出特征:

\[\mathbf{F}_{p}=\mathcal{X}-\operatorname{Conv}(\mathbf{K}, p, \mathbf{P}, \mathbf{F})=\operatorname{Conv}\left(\mathbf{K}, \operatorname{MLP}(\mathbf{P}-p) \times\left[M L P_{\delta}(\mathbf{P}-p), \mathbf{F}\right]\right)\]

其中$MLP_{\delta}(\cdot)$是单独作用在一个点上的多层感知机,在\(\mathcal{X}\)-卷积的所有操作都是可导的,那么\(\mathcal{X}\)-卷积也是可导的,因此可以被用到其他的反向传播神经网络中。

注意到\(\mathcal{X}\)-卷积首先将邻域点都归一化到点$p$的相对位置上,从而获得局部特征(算法1-3行)。在输出特征时,需要邻域点和对应的特征一起确定,但是局部坐标的维度和表示与对应的特征不一样。为了解决这个问题,首先将坐标提升到更高的维度上和更抽象的表示,然后将其与对应的特征进行拼接,用于后面的处理。

本文通过\(\mathcal{X}\)-变换对坐标和特征进行赋权和排序,\(\mathcal{X}\)-变换是通过所有的相邻点共同学习得到的。最终的\(\mathcal{X}\)依赖于点的顺序,并根据输入点排列对\(\mathbf{F}_{*}\)进行排序。对于没有任何附加特性的输入点云,即\(\mathbf{F}\)为空,第一个\(\mathcal{X}\)-卷积层只使用\(\mathbf{F}_{\delta}\)。因此PointCNN可以以鲁棒通用的方式处理有或没有附加特性的点云。

Pytorch框架下\(\mathcal{X}\)-卷积层的实现如下,其中采样点选择算法使用最远点采样,邻域点选择算法使用k近邻。

def index_points(points, idx):
    """

    Input:
        points: input points data, [B, N, C]
        idx: sample index data, [B, S]
    Return:
        new_points:, indexed points data, [B, S, C]
    """
    device = points.device
    B = points.shape[0]
    view_shape = list(idx.shape)
    view_shape[1:] = [1] * (len(view_shape) - 1)
    repeat_shape = list(idx.shape)
    repeat_shape[0] = 1
    batch_indices = torch.arange(B, dtype=torch.long).to(device).view(view_shape).repeat(repeat_shape)
    new_points = points[batch_indices, idx, :]
    return new_points

def square_distance(src, dst):
    """
    Calculate Euclid distance between each two points.
    src^T * dst = xn * xm + yn * ym + zn * zm;
    sum(src^2, dim=-1) = xn*xn + yn*yn + zn*zn;
    sum(dst^2, dim=-1) = xm*xm + ym*ym + zm*zm;
    dist = (xn - xm)^2 + (yn - ym)^2 + (zn - zm)^2
         = sum(src**2,dim=-1)+sum(dst**2,dim=-1)-2*src^T*dst

    Input:
        src: source points, [B, N, C]
        dst: target points, [B, M, C]
    Output:
        dist: per-point square distance, [B, N, M]
    """
    B, N, _ = src.shape
    _, M, _ = dst.shape
    dist = -2 * torch.matmul(src, dst.permute(0, 2, 1))
    dist += torch.sum(src ** 2, -1).view(B, N, 1)
    dist += torch.sum(dst ** 2, -1).view(B, 1, M)
    return dist
    
def farthest_point_sample(xyz, npoint):
    """
    Input:
        xyz: pointcloud data, [B, N, C]
        npoint: number of samples
    Return:
        centroids: sampled pointcloud index, [B, npoint]
    """
    device = xyz.device
    B, N, C = xyz.shape
    centroids = torch.zeros(B, npoint, dtype=torch.long).to(device)
    distance = torch.ones(B, N).to(device) * 1e10
    farthest = torch.zeros(B, dtype=torch.long).to(device)
    batch_indices = torch.arange(B, dtype=torch.long).to(device)
    for i in range(npoint):
        centroids[:, i] = farthest
        centroid = xyz[batch_indices, farthest, :].view(B, 1, 3)
        dist = torch.sum((xyz - centroid) ** 2, -1)
        mask = dist < distance
        distance[mask] = dist[mask]
        farthest = torch.max(distance, -1)[1]
    return centroids

def knn_point(nsample, xyz, new_xyz):
    """
    Input:
        nsample: max sample number in local region
        xyz: all points, [B, N, C]
        new_xyz: query points, [B, S, C]
    Return:
        group_idx: grouped points index, [B, S, nsample]
    """
    sqrdists = square_distance(new_xyz, xyz)
    _, group_idx = torch.topk(sqrdists, nsample, dim=-1, largest=False, sorted=False)
    return group_idx

def sample_and_group(npoint, nsample, xyz, points):
    """
    Input:
        npoint:  最远点采样数量
        nsample: 邻域采样点数量
        xyz: input points position data, [B, N, C]
        points: input points data, [B, N, D]
    Return:
        new_xyz: sampled points position data, [B, 1, C]
        new_points: sampled points data, [B, 1, N, C+D]
    """
    B, N, C = xyz.shape
    S = npoint
    fps_idx = farthest_point_sample(xyz, npoint)  # [B, npoint, C]
    new_xyz = index_points(xyz, fps_idx)
    idx = knn_point(nsample, xyz, new_xyz)
    grouped_xyz = index_points(xyz, idx)  # [B, npoint, nsample, C]
    grouped_xyz_norm = grouped_xyz - new_xyz.view(B, S, 1, C)
    if points is not None:
        grouped_points = index_points(points, idx)
        new_points = torch.cat([grouped_xyz_norm, grouped_points], dim=-1)  # [B, npoint, nsample, C+D]
    else:
        new_points = grouped_xyz_norm
    return new_xyz, new_points, grouped_xyz_norm, idx

def sample_and_group_all(xyz, points):
    """
    Input:
        xyz: input points position data, [B, N, C]
        points: input points data, [B, N, D]
    Return:
        new_xyz: sampled points position data, [B, 1, C]
        new_points: sampled points data, [B, 1, N, C+D]
    """
    device = xyz.device
    B, N, C = xyz.shape
    # new_xyz = torch.zeros(B, 1, C).to(device)
    new_xyz = xyz.mean(dim=1, keepdim=True)
    grouped_xyz = xyz.view(B, 1, N, C) - new_xyz.view(B, 1, 1, C)
    if points is not None:
        new_points = torch.cat([grouped_xyz, points.view(B, 1, N, -1)], dim=-1)
    else:
        new_points = grouped_xyz
    return new_xyz, new_points, grouped_xyz

class WeightNet(nn.Module):
    def __init__(self, in_channel, out_channel, hidden_unit=[8, 8]):
        super(WeightNet, self).__init__()
        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList()
        if hidden_unit is None or len(hidden_unit) == 0:
            self.mlp_convs.append(nn.Conv2d(in_channel, out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))
        else:
            self.mlp_convs.append(nn.Conv2d(in_channel, hidden_unit[0], 1))
            self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[0]))
            for i in range(1, len(hidden_unit)):
                self.mlp_convs.append(nn.Conv2d(hidden_unit[i - 1], hidden_unit[i], 1))
                self.mlp_bns.append(nn.BatchNorm2d(hidden_unit[i]))
            self.mlp_convs.append(nn.Conv2d(hidden_unit[-1], out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))

    def forward(self, localized_xyz):
        # xyz : BxCxKxN
        weights = localized_xyz
        for i, conv in enumerate(self.mlp_convs):
            bn = self.mlp_bns[i]
            weights = F.relu(bn(conv(weights)))
        return weights

class PointConvDensitySetAbstraction(nn.Module):
    def __init__(self, npoint, nsample, in_channel, mlp, group_all):
        super(PointConvDensitySetAbstraction, self).__init__()
        self.npoint = npoint
        self.nsample = nsample
        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList()
        last_channel = in_channel
        for out_channel in mlp:
            self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1))
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))
            last_channel = out_channel

        self.weightnet = WeightNet(3, 16)
        self.linear = nn.Linear(16 * mlp[-1], mlp[-1])
        self.bn_linear = nn.BatchNorm1d(mlp[-1])
        self.group_all = group_all

    def forward(self, xyz, points):
        """
        Input:
            xyz: input points position data, [B, C, N]
            points: input points data, [B, D, N]
        Return:
            new_xyz: sampled points position data, [B, C, S]
            new_points_concat: sample points feature data, [B, D', S]
        """
        B = xyz.shape[0]
        N = xyz.shape[2]
        xyz = xyz.permute(0, 2, 1)
        if points is not None:
            points = points.permute(0, 2, 1)

        if self.group_all:
            new_xyz, new_points, grouped_xyz_norm = sample_and_group_all(xyz, points)
        else:
            new_xyz, new_points, grouped_xyz_norm, _ = sample_and_group(self.npoint, self.nsample, xyz, points)
        # new_xyz: sampled points position data, [B, npoint, C]
        # new_points: sampled points data, [B, npoint, nsample, C+D]
        new_points = new_points.permute(0, 3, 2, 1)  # [B, C+D, nsample, npoint]
        for i, conv in enumerate(self.mlp_convs):
            bn = self.mlp_bns[i]
            new_points = F.relu(bn(conv(new_points)))

        grouped_xyz = grouped_xyz_norm.permute(0, 3, 2, 1)
        weights = self.weightnet(grouped_xyz)
        new_points = torch.matmul(input=new_points.permute(0, 3, 1, 2), other=weights.permute(0, 3, 2, 1)).view(B,
                                                                                                                self.npoint,
                                                                                                                -1)
        new_points = self.linear(new_points)
        new_points = self.bn_linear(new_points.permute(0, 2, 1))

        new_points = F.relu(new_points)
        new_xyz = new_xyz.permute(0, 2, 1)
        return new_xyz, new_points

2. PointCNN

PointCNN的输入为\(\mathbb{F}_{1}=\left\{\left(p_{1, i}, f_{1, i}\right): i=1,2, \ldots, N_{1}\right\}\),其中\(\left\{p_{1, i}: p_{1, i} \in\mathbb{R}^{\text {Dim }}\right\}\)是一组点,还有每个点对应的特征\(\left\{f_{1, i}: f_{1, i} \in \mathbb{R}^{C_{1}}\right\}\)。根据基于卷积网络的分层构造,在\(\mathbb{F}_{1}\)上应用\(\mathcal{X}\)-卷积便可得到更高层的表示\(\mathbb{F}_{2}=\left\{\left(p_{2, i}, f_{2, i}\right): f_{2, i} \in \mathbb{R}^{C_{2}}, i=1,2, \ldots, N_{2}\right\}\),其中\(\left\{p_{2, i}\right\}\)是\(\left\{p_{1, i}\right\}\)的一组表示点,\(\mathbb{F}_{2}\)的空间分辨率比\(\mathbb{F}_{1}\)小,\(\mathbb{F}_{2}\)的通道数比\(\mathbb{F}_{1}\)多,即$N_2<N_2,C_2>C_1$。当上述操作循环进行后,带有输入点的特征会被“投影”或是“聚合”到更少的点,但是每个点的特征信息却是更加丰富。

a描述了带有两个\(\mathcal{X}\)-卷积层的PointCNN结构,将输入点(带或不带特征)逐渐变成很少的表示点,但是这些点具有丰富的特征。在第二个\(\mathcal{X}\)-卷积层后,仅剩下一个表示点,这是从前面那些层中所有点的信息聚合在一起的表示点。在PointCNN中,可以将每个表示点的感知域定义为一个比例$K/N$,其中$K$是相邻点的数量,$N$是之前那一层中点的数量。这样最后的那个点可以“看到”之前所有层的点,因此其感知域的比例为$1.0$:具有整个形状的全局视野,并且其特征对于形状的语义理解也是信息非常丰富。在最后的\(\mathcal{X}\)-卷积层后面加上全连接层,接着跟一个损失函数便可训练这个网络。

class PointConvDensityClsSsg(nn.Module):
    def __init__(self, cls = 40):
        super(PointConvDensityClsSsg, self).__init__()
        feature_dim = 3
        self.sa1 = PointConvDensitySetAbstraction(npoint=512, nsample=32, in_channel=feature_dim, mlp=[64, 64, 128], bandwidth = 0.1, group_all=False)
        self.sa2 = PointConvDensitySetAbstraction(npoint=128, nsample=64, in_channel=128 + 3, mlp=[128, 128, 256], bandwidth = 0.2, group_all=False)
        self.sa3 = PointConvDensitySetAbstraction(npoint=1, nsample=None, in_channel=256 + 3, mlp=[256, 512, 1024], bandwidth = 0.4, group_all=True)
        self.fc1 = nn.Linear(1024, 512)
        self.bn1 = nn.BatchNorm1d(512)
        self.drop1 = nn.Dropout(0.7)
        self.fc2 = nn.Linear(512, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.drop2 = nn.Dropout(0.7)
        self.fc3 = nn.Linear(256, cls)

    def forward(self, xyz, feat=None):
        B, _, _ = xyz.shape
        l1_xyz, l1_points = self.sa1(xyz, feat)
        l2_xyz, l2_points = self.sa2(l1_xyz, l1_points)
        l3_xyz, l3_points = self.sa3(l2_xyz, l2_points)
        x = l3_points.view(B, 1024)
        x = self.drop1(F.relu(self.bn1(self.fc1(x))))
        x = self.drop2(F.relu(self.bn2(self.fc2(x))))
        x = self.fc3(x)
        x = F.log_softmax(x, -1)
        return x

注意到点的数量在\(\mathcal{X}\)-卷积层中下降的很快,使得简单的网络无法全面地进行训练。为了解决这个问题,进一步提出了带有稠密连接的PointCNN模型,在\(\mathcal{X}\)-卷积层中保留了更多的表示点,如图b所示。若仍然保持网络的深度不变,同时保持感知域的增长率,PointCNN不再以固定的$K$个相邻点作为输入,而是随机的从$K \times D$个相邻点中随机采样出$K$个输入点,其中$D$是扩张率。在这种情况下,在没有增加实际相邻点总数的和核大小的情况下,感知域比例从$K/N$增长到$(K\times D)/N$。图b中最后的\(\mathcal{X}\)-卷积层中的$4$个表示点都可以“看到”整个形状,因此都适合用于做预测。在测试阶段,softmax之前可以将多个预测结果取平均数,使预测结果更加稳定。

对于分割任务,需要输出原分辨率的点,这可以通过构造Conv-DeConv结构实现,其中DeConv部分就是将全局信息传播到更高分辨率预测的过程,见图cPointCNN分割网络中的ConvDecConv都是\(\mathcal{X}\)-卷积操作,唯一不同的便是后者的输出具有更多的点,更少的通道数。

在最后的全连接层前面使用dropout减少过拟合现象,还使用了subvolume supervision进一步减少过拟合。subvolume supervision是指在最后的\(\mathcal{X}\)-卷积层中,感知域比例被设置为小于$1$的数,以便于仅有部分信息被表示点观察到。在训练过程中,该网络被要求更艰难地从部分信息中学习,这样在测试时就会表现得更好。在这种情况下,表示点的全局坐标很重要,因此通过$MLP_{g}(\cdot)$将全局坐标提升到特征空间\(\mathbb{R}^{C_g}\),并拼接到\(\mathcal{X}\)-卷积中,以便通过后续层进行进一步处理。

为了提高泛化性,对输入点进行随机采样和打乱,这样batchbatch间相邻点集和顺序就会不一样。为了训练一个数量为$N$的点作为输入,选择\(\mathcal{N}(N,(N/8)^2)\)个点用于训练,其中\(\mathcal{N}\)表示高斯分布,这样做对于PointCNN的训练至关重要。