中科院电子所十室自研UWB MIMO雷达.

超宽带(Ultra-wideband, UWB)雷达常用于穿墙成像。其中单输入单输出穿墙雷达系统只能提供一维距离信息,无法进行二维成像。合成孔径雷达(Synthetic aperture radar, SAR)使用单个雷达沿固定基线扫描合成孔径以获得多角度信息,但其移动扫描耗时、不方便。MIMO雷达使用多输入多输出的阵列拓扑,具有实时成像分辨率和高杂波抑制能力。

MIMO雷达的成像质量主要由硬件系统和成像算法决定。成像的方位分辨率和高度分辨率受到雷达孔径的限制。雷达孔径越大,成像分辨率越高,但需要更多的信道。本文设计了一种易于实现的多信道MIMO雷达,在提高雷达孔径的同时进行实时成像,其中雷达的信道是通过微波开关进行分时复用。

常用的雷达成像算法包括反向投影算法(back projection,BP)距离迁移算法(range migration,RM)压缩感知(compressed sensing,CA)

本文导出了一个修正的基尔霍夫三维成像公式,适用于UWB MIMO雷达的三维穿墙成像计算。目标运动对成像的影响是不可忽略的,由于本文设计的雷达通过通道切换分时复用,切换过程中降低了系统的扫描率,因此需要更高的信噪比,获取单帧数据的时间也会更长。本文引入了一个参考信道来估计运动目标的位置变化,用于进行运动补偿。

为了获得较大的矩形孔径,并在方位、高度和对角方向(方位角的45°)实现较低的峰值旁瓣比和高分辨率,四个发射天线放置在阵列的角落,接收天线分布在两个同心圆上,因为同心圆的投影在各个方向上是一致的,这可以确保每个维度的一维投影冗余最小化,并降低旁瓣值。参考通道位于阵列中心,天线为平面阿基米德螺旋天线,如图所示。

笔者的一些理论证明

本周,笔者试图在理论上给出证明:对雷达回波信号进行BP成像后用三维卷积神经网络处理和直接对雷达回波信号用Transformer进行处理在满足特定条件时是等价的。下面直接给出证明思路和结论:

MIMO雷达系统使用窄脉冲作为发射信号。发射天线发射电磁波信号,穿透墙体后照射人体目标,并反射回波信号。假设MIMO阵列有$M$个发射天线和$N$个接收天线,则共有$MN$个通道,回波信号经过处理后可以被表示为:

\[S_{\text{origin}}(m,n,t) = \sum_{p}^{P} \sigma_p f(t- \tau_{mn,p})\]

The MIMO radar system adopts a narrow pulse signal as the transmitted signal. The transmitting antenna transmits the electromagnetic wave signal, penetrates the wall occlusion, irradiates the human target, and then reflects the radar echo. Suppose the MIMO array has $M$ transmitting antennas and $N$ receiving antennas, there are a total of $MN$ channels. After preprocessing, the expression of the echo signal is defined as:

其中$\sigma_p$为空间中成像点$p$的反射率,$t$是脉冲信号$f$的时间,$\tau_{mn,p}$是从空间点$p$到第$m$个发射天线和第$n$个接收天线的往返传播时延。

where $\sigma_p$ denotes the reflect ratio of point $p$, $t$ is the time of the pulse signal $f$, and $\tau_{mn,p}$ is the round-trip propagation delay from the pixel p to the $m$th transmitting antenna and the $n$th receiving antenna, where $m = 1, 2, . . . , M, n = 1, 2, . . . , N$.

在实际应用场景中,不仅人体目标会反射回波信号,场景中的其他家具和墙壁也会射回波信号。 由于人体目标不能完全静止,即使人体目标静止,其呼吸和心跳也会影响雷达信号。 因此,在雷达信号处理中通常需要预处理来滤除静止杂波,提高了杂波背景下的目标检测能力和雷达系统的抗干扰能力。若用$f_{\text{denoise}}$表示信号的全局滤波模板,则经过预处理的信号可以表示为:

\[S(m,n,t) = f_{\text{denoise}} \ast S_{\text{origin}}(m,n,t)\]

However, in the actual application scene, the human target will reflect echo, and other furniture and walls in the detection scene will also reflect echo. Since the human target cannot be completely stationary, even if the human target is stationary, its respiration and heartbeat will modulate the radar signal. Therefore, the preprocessing is usually used in radar signal processing to filter out the stationary clutter, which greatly improves the detection ability of the target in the background of clutter and improves the anti-interference ability of radar system. $f_{\text{denoise}}$ is the global filtering mask for the signal, then the preprocessed echo signal is defined as:

反向投影算法是最常用的穿墙成像方法,它是一种典型的、广泛应用的时域图像形成方法,适用于几乎任何天线阵列布局。反向投影算法的基本思想是计算成像区域内的一个点与接收天线和发射天线之间的延时,然后相干叠加所有回波的贡献,以获得图像中该点的相应像素值。继而通过逐点相干叠加整个成像区域来获得成像区域的图像。在成像空间中点$p$的三维反向投影成像公式可以表示为:

\[I(p) = \sum_{mn=1}^{MN} S(m,n,\tau_{mn,p})\]

The back-projection (BP) algorithm is the most commonly used method for through-wall imaging, which is a typical and widely used time-domain image formation, applicable for almost any antenna array layout. The basic idea of the BP algorithm is to calculate the time delay between a point in the imaging area and the receiving and transmitting antennas, and then coherently stack the contributions of all echoes to obtain the corresponding pixel value of the point in the image. In this way, the image of the imaging area can be obtained by coherently stacking the whole imaging area point by point. As for the point, p, in the imaging space, its 3D BP imaging formation of MIMO radar can be represented as:

其中$MN$是总通道数,$S$是之前定义的雷达回波信号,$\tau_{mn,p}$是从空间点$p$到第$m$个发射天线和第$n$个接收天线的往返传播时延。

where $MN$ is the total channels, $S$ is the radar echo as defined in (1), and $\tau_{x,y,z,mn}$ is the round-trip propagation delay from the spatial location $(x,y,z)$ to the $m$th transmitting antenna and the $n$th receiving antenna, where $m = 1, 2, . . . , M, n = 1, 2, . . . , N$.

人体目标通常是一个扩展的目标,包含身体各部分的若干散射点。上述公式计算所有通道的散射点,将能量聚焦在四肢位置,并得到人体目标在空间中的形状分布,即包含距离-方位-高度信息的人体目标的三维成像结果。然而,由于雷达孔径的限制,三维成像的空间分辨率很低,人眼很难观察到人体目标的形状。因此通常使用卷积神经网络来进一步重建人类目标的姿态。

As for the human target, it is an extended target that contains many scattering points of various parts of the body. We use (2) to calculate these scattering points of all channels to focus the energy at the position of the limbs, and then obtain the morphological distribution of the human target in space, which is the 3D imaging result of the human target containing distance-azimuth-height information. However, due to the limitation of the radar aperture, the resolution of 3D imaging is low, and it is hard to observe the shape of the human target. Therefore, we consider using a convolutional neural network to further reconstruct the pose of the human target.

卷积神经网络作为一种层次化的网络结构,可以通过堆叠卷积层对输入数据进行分层分离和抽象,实现特征抽取和类别辨识的集成。典型的卷积神经网络由卷积层、池化层和非线性激活函数组成。各组成部分相互配合,通过前向传播进行分层数据分析,通过反向传播进行误差传递和网络参数优化。在前向传播中,向网络输入一个样本张量,张量的每个元素经过隐藏层的逐步加权求和和非线性激活,从输出层输出一个预测张量。反向传播则与之相反,它采用梯度下降法从神经网络模型的最后一层到第一层逐层更新模型参数,以最小化模型对训练数据的损失函数。

As a hierarchical network structure, CNN can separate and abstract the input data layer-by-layer through stacked convolutional layers to realize the integration of feature extraction and classification recognition. The typical convolutional neural network consists of the following three parts: convolutional layer, pooling layer, and nonlinear activation function. The components of the convolutional neural network stack with each other, and hierarchical data analysis is carried out through forwarding propagation, and error conduction and network parameter optimization are carried out through backward propagation. The forward propagation is the process of first inputting a sample vector to the network, then, each element of the sample vector goes through the step-by-step weighted summation and nonlinear activation of each hidden layer, and finally outputs a prediction vector from the output layer. The backward propagation is the opposite of the forwarding propagation, which uses the gradient descent method to obtain model parameters layer-by-layer from the last layer to the first layer of the neural network model to minimize the loss function of the neural network model on the training data.

三维卷积神经网络被广泛应用于光学图像的人体姿态重构任务中。首先从输入三维点云中提取包含人体强度位置信息的特征图,再将其重构为人体目标的位置概率映射。通过选择合适的步长参数与填充参数,通过卷积层能够生成与输入张量尺寸相同、但具有更多通道携带更丰富语义信息的特征图。对于某个通道,其点$p$位置的特征可以被计算为:

\[Y_{p} = \sum_{\delta \in \Delta_K}^{} I(p+\delta) \cdot W_{\delta}+b_1 \\ = \sum_{\delta \in \Delta_K}^{} \sum_{mn=1}^{MN} S(m,n,\tau_{mn,p+\delta}) \cdot W_{\delta}+b_1 \quad (1)\]

The CNN network is widely used in the optical image field for the pose reconstruction task. Firstly, the feature extraction module extracts the intensity and location information of the human body in the input image through multilayer convolution operations, and then obtains the depth feature matrix. Secondly, the pose reconstruction module transforms the feature matrix into probability confidence maps through multilayer deconvolution operations. Finally, the joint location information of the human target is reconstructed from the probability confidence maps using the softmax function. Furthermore, the pooling layer and nonlinear activation function layer are after each convolutional and deconvolutional layer.

其中$\Delta_K$表示卷积核,$\delta$是卷积核中的位置偏移(相对于卷积核的中心位置)。$W_{\delta}$和$b_1$是卷积神经网络的参数。上式表示点$p$位置的特征由点$p$位置及其附近位置(称之为感受野)共同决定。随着卷积层的加深,感受野也不断增大。

另一方面,雷达回波信号$S(m,n,t)$可以被表示为二维张量$S(mn,t)$,简记为$S$。将该张量的每一列看作一个token,即将回波信号看作长度为$mn$的序列。应用Transformer,即对回波信号$S$应用自注意力操作后可得(其原理从略,已在笔者以往周报中多次介绍):

\[\text{SelfAttention}(S) = \text{softmax}(SW_qW_k^TS^T) \cdot S \cdot W_v\]

其中$W_q$、$W_k$和$W_v$是自注意力运算引入的权重参数。通常使用多个自注意力的head,即multi-head的自注意力机制,假设共使用了$N_h$个head,则最终的计算结果为:

\[\text{MultiHeadSelfAttention}(S) = \mathop{\text{concat}}_{h \in N_h} [\text{SelfAttention}_h(S)] \cdot W_h + b_2 \\ = \mathop{\text{concat}}_{h \in N_h} [\text{softmax}(SW_q^h(W_k^h)^TS^T) \cdot S \cdot W_v^h] \cdot W_h + b_2\]

其中$W_h$和$b_2$是multi-head运算引入的参数。若将$W_h$划分成$N_h$个子矩阵$W_h=[W_h^1,…,W_h^h,…,W_h^{N_h}]$,则concat与矩阵乘法可以被替换为矩阵乘法之和,即:

\[\text{MultiHeadSelfAttention}(S) = \sum_{h \in N_h}^{} \text{softmax}(SW_q^h(W_k^h)^TS^T) \cdot S \cdot W_v^h \cdot W_h^h + b_2\]

为简化表达式,进一步记$W_1^h = W_q^h(W_k^h)^T$,$W_2^h = W_v^h W_h^h$,则对雷达回波信号$S$应用multi-head自注意力机制后,其输出表达式可以表示为:

\[\text{MultiHeadSelfAttention}(S) = \sum_{h \in N_h}^{} \text{softmax}(SW_1^hS^T) \cdot S \cdot W_2^h + b_2\]

softmax函数的运算性质可知,其输出结果是一个归一化的概率分布向量,将其记作\(\text{softmax}( \cdot )=[ \text{mask}_1,...,\text{mask}_{mn},...,\text{mask}_{MN} ]\),则将上式化简为:

\[\text{MultiHeadSelfAttention}(S) = \sum_{h \in N_h}^{} \sum_{mn=1}^{MN} \text{mask}_{mn} \cdot S(m,n,t) \cdot W_2^h + b_2 \quad (2)\]

对比公式$(1)$与公式$(2)$,不难得出对雷达回波信号进行BP成像后用三维卷积神经网络处理和直接对雷达回波信号用Transformer进行处理在满足特定条件时是等价的,其条件如下:

  1. 卷积神经网络的卷积核尺寸为$\sqrt{N_h} \times \sqrt{N_h}$,自注意力运算的head数为$N_h$;
  2. 卷积神经网络的步长参数和填充参数应满足其输出特征图的尺寸与输入特征相同;
  3. softmax函数计算出的mask与时延$\tau_{mn,p+\delta}$是匹配的。

尽管在实践中,上述条件很难得到满足;但足以说明两种方法具有可替换性。上述公式中的参数都是通过网络学习得到的,在实践中通过成像和卷积的操作已被证明能够重构准确的三维人体姿态,则直接将Transformer应用到回波信号上进行重构也是可行的。后者由于可以并行运算,通常具有更快的推理速度和更低的网络FLOPS值,且能够捕捉信号的全局信息,因此理论上是更优的。