隐式梯度正则化方法.
本文作者指出,在模型训练过程中采用有限的学习率能够隐式地给优化过程带来梯度惩罚项,而这个梯度惩罚项对于提高泛化性能是有帮助的。因此通常不应该用过小的学习率。
梯度下降法的一般更新过程如下:
\[\theta_{t+\gamma} = \theta_t - \gamma g(\theta_t)\]考虑如下泰勒展开:
\[\begin{aligned} \theta_{t+\gamma} &= \theta_{t} + \gamma \dot{\theta}_{t} + \frac{1}{2}\gamma^2 \ddot{\theta}_{t}+ \frac{1}{6}\gamma^3 \theta_{t}^{(3)} + \cdots \\ & \approx \theta_{t} - \gamma \tilde{g}(\theta_t) \end{aligned}\]下面分析\(g(\theta_t)\)和\(\tilde{g}(\theta_t)\)之间的关系。记$\nabla$为微分算子,则有:
\[\begin{aligned} \theta_{t+\gamma} &= \theta_{t} + \gamma \nabla \theta_{t} + \frac{1}{2}\gamma^2 \nabla^2 \theta_{t}+ \frac{1}{6}\gamma^3 \nabla^3 \theta_{t} + \cdots \\ & = \left( 1+ \gamma \nabla +\frac{1}{2}\gamma^2 \nabla^2 + \frac{1}{6}\gamma^3 \nabla^3 + \cdots\right)\theta_{t} \\ & = e^{\gamma \nabla}\theta_{t} \end{aligned}\]因此梯度下降公式可写作:
\[\left( e^{\gamma \nabla}-1\right)\theta_t = - \gamma g(\theta_t)\]根据微分算符的级数运算,有:
\[\begin{aligned} \nabla \theta_t &= - \gamma \left( \frac{\nabla}{e^{\gamma \nabla}-1}\right) g(\theta_t) = - \frac{\gamma\nabla}{e^{\gamma \nabla}-1} g(\theta_t) \\ &= - \left( 1-\frac{1}{2}\gamma \nabla + \frac{1}{12} \gamma^2 \nabla^2-\frac{1}{720}\gamma^4 \nabla^4 + \cdots \right) g(\theta_t) \\ &\approx - \left( 1-\frac{1}{2}\gamma \nabla \right) g(\theta_t) = -g(\theta_t) + \frac{1}{2}\gamma \nabla g(\theta_t) \\ &= -g(\theta_t) + \frac{1}{2}\gamma \nabla_{\theta_t} g(\theta_t) \nabla \theta_t\\ &= -g(\theta_t) + \frac{1}{2}\gamma \nabla_{\theta_t} g(\theta_t) \left[-g(\theta_t) + \frac{1}{2}\gamma \nabla_{\theta_t} g(\theta_t) \nabla \theta_t\right] \\ &\approx -g(\theta_t) - \frac{1}{2}\gamma \nabla_{\theta_t} g(\theta_t)g(\theta_t) = -g(\theta_t) - \frac{1}{4}\gamma \nabla_{\theta_t} ||g(\theta_t)||^2 \end{aligned}\]因此有:
\[\begin{aligned} \tilde{g}(\theta_t) &=-\dot{\theta}_{t} - \frac{1}{2}\gamma \ddot{\theta}_{t}- \frac{1}{6}\gamma^2 \theta_{t}^{(3)} + \cdots \\ & \approx g(\theta_t) + \frac{1}{4}\gamma \nabla_{\theta_t} ||g(\theta_t)||^2 \\ & = \nabla_{\theta_t} \left( L(\theta_t) + \frac{1}{4}\gamma ||\nabla_{\theta_t} L(\theta_t)||^2 \right) \end{aligned}\]因此基于一阶梯度的梯度下降算法其实就相当于往损失函数里边加入了梯度惩罚形式的正则项,而梯度惩罚项有助于模型到达更加平缓的区域,有利于提高泛化性能。如果$\gamma \to 0$,这个隐式的惩罚则会变弱甚至消失。
因此在训练过程中,学习率设置不宜过小,较大的学习率不仅有加速收敛的好处,还有提高模型泛化能力的好处。此外也可以显式地将梯度惩罚加入到损失中:
\[\mathcal{L}(x,y;W) + \lambda ||\nabla_{W} \mathcal{L}(x,y;W)||^2\]