为什么梯度裁剪能够加速训练:适应性的理论依据.

本文主要分析了为什么梯度裁剪能加速深度学习的训练过程,主要结论是梯度裁剪引入了比Lipschitz约束更宽松的约束条件。

1. 梯度裁剪

假设损失函数为$f(\theta)$,则梯度下降的更新过程为:

\[\theta \leftarrow \theta - \eta \nabla_{\theta}f(\theta)\]

梯度裁剪(gradient clipping)是根据梯度的模长来对更新量做一个缩放,控制更新量的模长不超过一个常数:

\[\theta \leftarrow \theta - \eta \nabla_{\theta}f(\theta) \times \min \left\{ 1, \frac{\gamma}{||\nabla_{\theta}f(\theta)||} \right\}\]

上式可以等价地表示为:

\[\theta \leftarrow \theta - \eta \nabla_{\theta}f(\theta) \times \frac{\gamma}{||\nabla_{\theta}f(\theta)||+\gamma}\]

2. $(L_0,L_1)$-Smooth

作者观察到,损失函数的光滑程度与梯度模长近似呈线性相关关系:

基于此,作者引入了一个比Lipschitz约束更加宽松的约束,称为$(L_0,L_1)$-Smooth

\[||\nabla_{\theta}f(\theta+\Delta \theta) - \nabla_{\theta}f(\theta)|| \leq \left(L_0+L_1 ||\nabla_{\theta}f(\theta)||\right) ||\Delta \theta||\]

定义一个辅助函数$f(\theta+t\Delta \theta),t\in[0,1]$,则有:

\[\begin{aligned} f(\theta+\Delta \theta) - f(\theta) &= \int_0^1 \frac{\partial f(\theta+t\Delta \theta)}{\partial t} dt \\ & = \int_0^1 <\nabla_{\theta} f(\theta+t\Delta \theta), \Delta \theta> dt \\ & = <\nabla_{\theta} f(\theta), \Delta \theta>+\int_0^1 <\nabla_{\theta} f(\theta+t\Delta \theta)-\nabla_{\theta} f(\theta), \Delta \theta> dt \\ & \leq <\nabla_{\theta} f(\theta), \Delta \theta>+\int_0^1 \mid\mid \nabla_{\theta} f(\theta+t\Delta \theta)-\nabla_{\theta} f(\theta)\mid\mid_2 \cdot \mid\mid \Delta \theta\mid\mid_2 dt \\ & \leq <\nabla_{\theta} f(\theta), \Delta \theta>+\int_0^1 \left(L_0+L_1 ||\nabla_{\theta}f(\theta)||\right) \cdot t\mid\mid \Delta \theta\mid\mid_2^2 dt \\ & = <\nabla_{\theta} f(\theta), \Delta \theta>+ \frac{1}{2}\left(L_0+L_1 ||\nabla_{\theta}f(\theta)||\right) \cdot \mid\mid \Delta \theta\mid\mid_2^2 \\ \end{aligned}\]

代入梯度下降公式$\Delta\theta= - \eta \nabla_{\theta}f(\theta)$得到:

\[\begin{aligned} f(\theta+\Delta \theta) - f(\theta) & \leq <\nabla_{\theta} f(\theta), \Delta \theta>+ \frac{1}{2}\left(L_0+L_1 ||\nabla_{\theta}f(\theta)||\right) \cdot \mid\mid \Delta \theta\mid\mid_2^2 \\ & = <\nabla_{\theta} f(\theta), - \eta \nabla_{\theta}f(\theta)>+ \frac{1}{2}\left(L_0+L_1 ||\nabla_{\theta}f(\theta)||\right) \cdot \mid\mid - \eta \nabla_{\theta}f(\theta)\mid\mid_2^2 \\ & = \left(\frac{1}{2}\left(L_0+L_1 ||\nabla_{\theta}f(\theta)||\right)\eta^2-\eta \right) \cdot \mid\mid \nabla_{\theta}f(\theta)\mid\mid_2^2 \\ \end{aligned}\]

为使$f(\theta+\Delta \theta)$相对于$f(\theta)$是减小的,则应满足:

\[\eta < \frac{2}{L_0+L_1 ||\nabla_{\theta}f(\theta)||}\]

而使得$f(\theta+\Delta \theta)-f(\theta)$近似最小的$\eta$取值为:

\[\eta = \frac{1}{L_0+L_1 ||\nabla_{\theta}f(\theta)||}\]

此时的更新过程为:

\[\theta \leftarrow \theta - \nabla_{\theta}f(\theta) \times \frac{1}{L_0+L_1 ||\nabla_{\theta}f(\theta)||}\]

因此$(L_0,L_1)$-Smooth与梯度裁剪是等价的,而且梯度裁剪使得梯度更新近似为损失下降最快的方向。