AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

AdaX:基于指数长期记忆的自适应梯度下降.

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Adafactor:减少Adam的显存占用.

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

NovoGrad:使用层级自适应二阶矩进行梯度归一化.

Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

一种用于实例分割的复制粘贴数据增强方法.

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

LAMB:结合层级自适应学习率与Adam.

卷积神经网络的可视化

Visualization methods of Convolutional Neural Networks.