Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units

GELU：随机正则化的高斯误差线性单元.

paper：Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units
arXiv：link

在深度学习模型中，通过引入激活函数增强模型的非线性；同时使用正则化提高模型的泛化能力。作者提出了高斯误差线性单元(Gaussian Error Linear Unit,GELU)，为激活函数引入了随机正则化效果。GELU把神经元的输入建模为标准正态分布，表达式如下：

\[\text{GELU}(x)=xP(X≤x)=x\Phi(x)\]

其中$\Phi(x)$是标准正态分布的累积分布函数(COF)，即：

\[\Phi(x) = \int_{-∞}^{x} \frac{e^{-\frac{t^2}{2}}}{\sqrt{2\pi}}dt\]

引入误差函数$\text{erf}(x)=\frac{2}{\sqrt{\pi}}\int_{0}^{x} e^{-t^2}dt$，则GELU也表示为：

\[\text{GELU}(x)=x\Phi(x)=x\int_{-∞}^{x} \frac{e^{-\frac{t^2}{2}}}{\sqrt{2\pi}}dt \\ = x\frac{1}{\sqrt{2\pi}}(\int_{-∞}^{x}e^{-\frac{t^2}{2}}dt) = x\frac{1}{\sqrt{2\pi}}(\int_{-∞}^{0}e^{-\frac{t^2}{2}}dt+\int_{0}^{x}e^{-\frac{t^2}{2}}dt) \\ = x\frac{1}{\sqrt{2\pi}}(\sqrt{\frac{\pi}{2}}+\sqrt{2}\int_{0}^{\frac{x}{\sqrt{2}}}e^{-(\frac{t}{\sqrt{2}})^2}d\frac{t}{\sqrt{2}}) \\ = x(\frac{1}{2}+\frac{1}{\sqrt{\pi}}\int_{0}^{\frac{x}{\sqrt{2}}}e^{-t^2}dt)= x(\frac{1}{2}+\frac{1}{2}\frac{2}{\sqrt{\pi}}\int_{0}^{\frac{x}{\sqrt{2}}}e^{-t^2}dt) \\ = x\cdot \frac{1}{2}(1+\text{erf}(\frac{x}{\sqrt{2}}))\]

GELU的表达式是非初等函数形式，在实际使用时，常使用其初等函数的近似。用Tanh函数或Sigmoid函数近似：

\[\text{GELU}(x)≈\frac{1}{2}x(1+\text{tanh}(\sqrt{\frac{2}{\pi}}(x+0.044715x^3)))\] \[\text{GELU}(x)≈xσ(1.702x)\]

下面简单介绍这两个近似公式的由来。

⚪ 使用tanh近似

\[\text{GELU}(x)≈\frac{1}{2}x(1+\text{tanh}(\sqrt{\frac{2}{\pi}}(x+0.044715x^3)))\]

对GELU近似，即对其中的非初等函数$\text{erf}(\frac{x}{\sqrt{2}})$近似。若考虑近似形式$\text{tanh}(ax+bx^3)$，对两式在$x=0$处进行泰勒展开:

\[\text{erf}(\frac{x}{\sqrt{2}})=\sqrt{\frac{2}{\pi}}x-\frac{1}{3\sqrt{2\pi}}x^3+o(x^5)\] \[\text{tanh}(ax+bx^3)=ax-\frac{(a^3-3b)x^3}{3}+o(x^5)\]

联立上述两式得：

\[a=\sqrt{\frac{2}{\pi}}\\b=\frac{2\sqrt{2}}{3\pi^{3/2}}-\frac{1}{3\sqrt{2\pi}}= \sqrt{\frac{2}{\pi}}(\frac{2}{3\pi}-\frac{1}{6})≈\sqrt{\frac{2}{\pi}}\cdot 0.0455399\]

则$\text{erf}(\frac{x}{\sqrt{2}})$可以近似写作：

\[\text{erf}(\frac{x}{\sqrt{2}})≈\text{tanh}(\sqrt{\frac{2}{\pi}}(x+0.0455399x^3))\]

注意到上式系数$0.0455399$与近似式的系数$0.044715$略有差异。这是因为泰勒展开是一种局部拟合的方法，在局部展开点$x=0$附近拟合精度比较高，但远离展开点时误差会逐渐增大；因此还需要考虑全局的拟合误差。注意到上述近似中$a=\sqrt{\frac{2}{\pi}}$是一阶局部近似，保留该局部近似；通过调整$b$减小全局误差。

将问题建模成min-max形式，即希望原函数$\text{erf}(\frac{x}{\sqrt{2}})$与近似函数$\text{tanh}(\sqrt{\frac{2}{\pi}}x+bx^3)$在全局的最大差异尽可能小，表示为下述问题：

\[\mathop{\min}_{b} \mathop{\max}_{x} \quad |\text{erf}(\frac{x}{\sqrt{2}})-\text{tanh}(\sqrt{\frac{2}{\pi}}x+bx^3)|\]

上式可以通过非线性规划求解：

import numpy as np
from scipy.special import erf
from scipy.optimize import minimize

def f(x, b):
    a = np.sqrt(2 / np.pi)
    return np.abs(erf(x / np.sqrt(2)) - np.tanh(a * x + b * x**3))

def g(b):
    return np.max([f(x, b) for x in np.arange(0, 4, 0.001)])

options = {'xtol': 1e-10, 'ftol': 1e-10, 'maxiter': 100000}
result = minimize(g, 0, method='Powell', options=options)
print(result.x) # [0.03567734]

解得$b≈0.03567734$，则$\text{erf}(\frac{x}{\sqrt{2}})$可以近似写作：

\[\text{erf}(\frac{x}{\sqrt{2}})≈\text{tanh}(\sqrt{\frac{2}{\pi}}x+0.03567734x^3) \\ = \text{tanh}(\sqrt{\frac{2}{\pi}}(x+0.0447149x^3))\]

因此GELU可以近似表示为：

\[\text{GELU}(x)= x\cdot \frac{1}{2}(1+\text{erf}(\frac{x}{\sqrt{2}})) \\ ≈\frac{1}{2}x(1+\text{tanh}(\sqrt{\frac{2}{\pi}}(x+0.0447149x^3)))\]

⚪ 使用sigmoid近似

\[\text{GELU}(x)≈xσ(1.702x)\]

按照之前的思路，采用全局近似使得原函数$\Phi(x)$与近似函数$\sigma(cx)$在全局的最大差异尽可能小，表示为下述min-max问题：

\[\mathop{\min}_{b} \mathop{\max}_{x} \quad |\Phi(x)-\sigma(cx)|\]

上式可以通过非线性规划求解：

import numpy as np
from scipy.stats import norm
from scipy.special import expit
from scipy.optimize import minimize

def f(x,c):
    return np.abs(norm.cdf(x) -expit(c * x))

def g(c):
    return np.max([f(x, c) for x in np.arange(0, 4, 0.001)])

options = {'xtol': 1e-10, 'ftol': 1e-10, 'maxiter': 100000}
result = minimize(g, 0, method='Powell', options=options)
print(result.x) # [1.70174493]

解得$c≈1.70174493$，则原函数$\Phi(x)$可以近似写作：

\[\Phi(x)≈\sigma(cx)=\sigma(1.70174493x)\]

因此GELU可以近似表示为：

\[\text{GELU}(x)= x\Phi(x) ≈x\sigma(1.70174493x)\]