通过对比预测编码进行表示学习.

对比预测编码(Contrastive Predictive Coding, CPC)是一种应用于高维数据(如音频数据)的无监督学习方法,把生成模型问题转化为分类问题,并构造InfoNCE损失通过交叉熵衡量模型从一个不相关的负样本集合中区分未来表示的能力。

CPC使用一个编码器$g_{enc}(\cdot)$把输入数据$x_t$压缩为潜在表示$z_t=g_{enc}(x_t)$,然后使用一个自回归解码器$g_{ar}(\cdot)$从过去的潜在表示中学习高级上下文特征$c_t=g_{ar}(z_{\leq t})$,并预测未来的信息\(\hat{z}_{t+k}=W_kc_t\)。

使用上下文特征$c_t$预测未来的信息时,最大化输入$x$和上下文向量$c$之间的互信息:

\[I(x;c) = \sum_{x,c} p(x,c) \log \frac{p(x,c)}{p(x)p(c)} = \sum_{x,c} p(x,c) \log \frac{p(x|c)}{p(x)}\]

CPC通过一个密度函数$f$来建模$x_{t+k}$和$c_t$之间的互信息:

\[f_k(x_{t+k},c_t) = \exp(z_{t+k}^TW_kc_t) \propto \frac{p(x_{t+k}|c_{t})}{p(x_{t+k})}\]

给定$N$个随机样本$X=(x_1,…,x_N)$,包含$1$个正样本$x_i$~$p(x_{t+k}|c_t)$和$N-1$个负样本$x_{l \neq i}$~$p(x_{t+k})$,则检测到正样本的概率为:

\[\begin{aligned} p(d=i| X,c_t) &= \frac{p(d=i,X| c_t)}{\sum_{j=1}^N p(d=j,X| c_t)} \\ &= \frac{p(x_i|c_t) \prod_{l \neq i} p(x_l) }{\sum_{j=1}^N p(x_j|c_t) \prod_{l \neq j} p(x_l)} \\&= \frac{p(x_i|c_t) /p(x_i) }{\sum_{j=1}^N p(x_j|c_t) /p(x_j)} = \frac{f(x_i,c_t) }{\sum_{j=1}^N f(x_j,c_t) } \end{aligned}\]

进而通过交叉熵构造区分正样本的InfoNCE损失:

\[\mathcal{L}_N = - \Bbb{E}_X[\log\frac{f_k(x_{t+k},c_t)}{\sum_{i=1}^N f_k(x_{i},c_t)}] = - \Bbb{E}_X[\log\frac{\exp(z_{t+k}^TW_kc_t)}{\sum_{i=1}^N \exp(z_{i}^TW_kc_t)}]\]

下面说明InfoNCE损失是互信息的一个下界InfoNCE损失可以写作:

\[\begin{aligned} \mathcal{L}_N &= - \Bbb{E}_X[\log\frac{f_k(x_{t+k},c_t)}{\sum_{i=1}^N f_k(x_{i},c_t)}] \\ &= - \Bbb{E}_X[\log\frac{\frac{p(x_{t+k}|c_{t})}{p(x_{t+k})}}{\frac{p(x_{t+k}|c_{t})}{p(x_{t+k})} + \sum_{x_j \in X_{neg}} \frac{p(x_{j}|c_{t})}{p(x_{j})}}] \\ &= \Bbb{E}_X[\log(1+\frac{p(x_{t+k})}{p(x_{t+k}|c_{t})} \sum_{x_j \in X_{neg}} \frac{p(x_{j}|c_{t})}{p(x_{j})}) ] \\ &\approx \Bbb{E}_X[\log(1+\frac{p(x_{t+k})}{p(x_{t+k}|c_{t})} (N-1) \Bbb{E}_{x_j} [\frac{p(x_{j}|c_{t})}{p(x_{j})}]) ] \\ &= \Bbb{E}_X[\log(1+\frac{p(x_{t+k})}{p(x_{t+k}|c_{t})} (N-1)) ] \\ &= \Bbb{E}_X[\log(\frac{p(x_{t+k})}{p(x_{t+k}|c_{t})}N+ \frac{p(x_{t+k}|c_{t})-p(x_{t+k})}{p(x_{t+k}|c_{t})}) ] \\ &\geq \Bbb{E}_X[\log(\frac{p(x_{t+k})}{p(x_{t+k}|c_{t})}N) ] \\ &= -\Bbb{E}_X[\log(\frac{p(x_{t+k}|c_{t})}{p(x_{t+k})}) ] + \log(N) \\ &= -I(x_{t+k},c_t) + \log(N) \end{aligned}\]

当把CPC应用于图像数据时,把图像划分成相互重叠的图像块,每个图像块通过ResNet编码器编码为特征向量$z_{i,j}$;然后通过自回归模型构造上下文特征$c_{i,j}$,并向下进行预测\(\hat{z}_{i+k,j}=W_kc_{i,j}\)。

CPC构造的InfoNCE损失旨在正确地区分预测目标,而负样本$z_l$来自其他图像块或其他图像:

\[\mathcal{L}_N = - \sum_{i,j,k} \log p(z_{i+k,j}|\hat{z}_{i+k,j},\{z_l\}) = - \sum_{i,j,k} \log\frac{\exp(\hat{z}_{i+k,j}^Tz_{i+k,j})}{\exp(\hat{z}_{i+k,j}^Tz_{i+k,j}) + \sum_l \exp(\hat{z}_{i+k,j}^Tz_{l})}\]