HFVAE:通过层级分解VAE实现结构化解耦表示.

1. 分解VAE的目标函数

VAE优化对数似然的变分下界ELBO:

\[\log p(x) = \log \Bbb{E}_{q(z|x)}[\frac{p(x,z)}{q(z|x)}] \geq \Bbb{E}_{q(z|x)}[\log \frac{p(x,z)}{q(z|x)}]\]

ELBO又可以写作:

\[\Bbb{E}_{q(z|x)}[\log \frac{p(x,z)}{q(z|x)}] = \Bbb{E}_{q(z|x)}[\log \frac{p(x,z)}{q(x,z)}+\log q(x)]\]

在实际中,VAE的目标定义为在有限数据点集的经验分布$q(x)$上每个数据点ELBO的期望:

\[\Bbb{E}_{q(x)} [\Bbb{E}_{q(z|x)}[\log \frac{p(x,z)}{q(x,z)}+\log q(x)]] = \Bbb{E}_{q(x,z)}[\log \frac{p(x,z)}{q(x,z)}]+\Bbb{E}_{q(x)}[\log q(x)]\]

其中$\Bbb{E}_{q(x)}[\log q(x)]$不包含可优化参数,则VAE的主要优化目标实际上为联合分布$q(x,z)$和$p(x,z)$的KL散度。作者对其进行进一步分解:

\[\begin{aligned} \Bbb{E}_{q(x,z)}[\log \frac{p(x,z)}{q(x,z)}] &= \Bbb{E}_{q(x,z)}[\log \frac{p(x,z)}{q(x,z)} \frac{p(x)}{p(x)} \frac{p(z)}{p(z)} \frac{q(x)}{q(x)} \frac{q(z)}{q(z)}] \\ &= \Bbb{E}_{q(x,z)}[\log \frac{p(x,z)}{p(x)p(z)}+\log \frac{q(z)q(x)}{q(x,z)} + \log \frac{p(x)}{q(x)}+\log \frac{p(z)}{q(z)}] \\ &= \Bbb{E}_{q(x,z)}[\log \frac{p(x,z)}{p(x)p(z)}] + \Bbb{E}_{q(x,z)}[\log \frac{q(z)q(x)}{q(x,z)}]+ \Bbb{E}_{q(x,z)}[\log \frac{p(x)}{q(x)}]+ \Bbb{E}_{q(x,z)}[\log \frac{p(z)}{q(z)}] \\ &= \Bbb{E}_{q(x,z)}[\log \frac{p(x|z)}{p(x)}] + \Bbb{E}_{q(x,z)}[\log \frac{q(z)}{q(z|x)}]+ \Bbb{E}_{q(x)}[\log \frac{p(x)}{q(x)}]+ \Bbb{E}_{q(z)}[\log \frac{p(z)}{q(z)}] \\ &= \Bbb{E}_{q(x,z)}[\log \frac{p(x|z)}{p(x)}] - \Bbb{E}_{q(x,z)}[\log \frac{q(z|x)}{q(z)}]- \Bbb{E}_{q(x)}[\log \frac{q(x)}{p(x)}]- \Bbb{E}_{q(z)}[\log \frac{q(z)}{p(z)}] \end{aligned}\]

VAEELBO最终可以分解为以下四项:

第①项和第②项增强条件分布之间的一致性。第①项\(\Bbb{E}_{q(x,z)}\) $[\log p(x|z)/p(x)]$衡量重构结果的唯一性,最大化生成每个样本$x$的隐变量$z$的可识别性;第②项\(\Bbb{E}_{q(x,z)}\) $[\log q(z|x)/q(z)]$衡量编码的唯一性,通过最小化互信息$I(z,x)$进行正则化,削弱隐变量$z$的可识别性。

第③项和第④项增强边际分布之间的一致性。第③项$KL[q(x)||p(x)]$匹配样本分布,等价于最大化对数似然$\Bbb{E}_{q(x)}[\log p(x)]$;第④项$KL[q(z)||p(z)]$匹配先验分布。

第①项在实践中很难处理,因为$p(x)$无法直接获取。通过结合第①项和第③项能够避免这种困难:

\[\Bbb{E}_{q(x,z)}[\log \frac{p(x|z)}{p(x)}] + \Bbb{E}_{q(x,z)}[\log \frac{p(x)}{q(x)}] = \Bbb{E}_{q(x,z)}[\log \frac{p(x|z)}{q(x)}]\]

为了研究每一项的影响,下图展示了从目标函数中去除每一项得到的结果。当去掉第③项或第④项时,可能会学习到$p(x)$偏离$q(x)$或$q(z)$偏离$p(z)$的模型。去掉第①项意味着不需要每个样本$x$对应唯一的隐变量$z$。去掉第②项意味着不限制互信息$I(z,x)$,将每个样本$x$映射到隐空间中的唯一区域。

2. Hierarchically Factorized VAE

作者旨在增强VAE学习特征之间的统计独立性,目标函数中的第④项$KL[q(z)||p(z)]$是实现特征解耦的关键,若预先指定先验分布$p(z)$的各维度之间是独立的,则学习到的隐变量特征分布$q(z)$也会倾向于特征独立。对第④项进行进一步分解:

\[\begin{aligned} - \Bbb{E}_{q(z)}[\log \frac{q(z)}{p(z)}] &= - \Bbb{E}_{q(z)}[\log \frac{q(z)}{p(z)} \frac{\prod_{d}p(z_d)}{\prod_{d}p(z_d)}\frac{\prod_{d}q(z_d)}{\prod_{d}q(z_d)}] \\ &= - \Bbb{E}_{q(z)}[\log \frac{q(z)}{\prod_{d}q(z_d)} + \log \frac{\prod_{d}q(z_d)}{\prod_{d}p(z_d)} + \log \frac{\prod_{d}p(z_d)}{p(z)}]\\ &= \Bbb{E}_{q(z)}[\log \frac{p(z)}{\prod_{d}p(z_d)}]- \Bbb{E}_{q(z)}[\log \frac{q(z)}{\prod_{d}q(z_d)}]- \Bbb{E}_{q(z)}[\log \frac{\prod_{d}q(z_d)}{\prod_{d}p(z_d)}] \end{aligned}\]

其中前两项为全相关(Total Correlation)项,合计为第A项;第三项衡量隐变量的每一个边际分布。如果$z_d$本身代表一组变量,可以继续将其分解为全相关项i和边际分布项ii,从而提供了归纳分离特征层次的机会。

原则上,可以在任何级别上继续此分解,从而构造HFVAE的目标函数$①+③+ii+\alpha ②+\beta A+\gamma i$。

\[\begin{aligned} & \Bbb{E}_{q(x,z)}[\log \frac{p(x|z)}{p(x)}] -KL[q(x)||p(x)] - \sum_{d,e}KL[q(z_{d,e})||p(z_{d,e})] \\ &- \alpha \Bbb{E}_{q(x,z)}[\log \frac{q(z|x)}{q(z)}] +\beta \Bbb{E}_{q(z)}[\log \frac{p(z)}{\prod_{d}p(z_d)}-\log \frac{q(z)}{\prod_{d}q(z_d)}] \\& + \gamma \sum_{d} \Bbb{E}_{q(z_d)}[\log \frac{p(z_d)}{\prod_{e}p(z_{d,e})}-\log \frac{q(z_d)}{\prod_{e}q(z_{d,e})}] \end{aligned}\]

对于该目标,前两项等价于重构损失,第三项衡量隐变量每个元素的KL散度,第四项通过$\alpha$控制互信息$I(z,x)$,第五项通过$\beta$控制变量组之间的全相关正则化,第六项通过$\gamma$控制组内的全相关正则化。