Memory-Efficient Adaptive Optimization

SM3:内存高效的自适应优化算法.

Averaging Weights Leads to Wider Optima and Better Generalization

SWA:通过随机权重平均寻找更宽的极小值.

Decoupled Weight Decay Regularization

AdamW:解耦梯度下降与权重衰减正则化.

ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

ULSAM:超轻量级子空间注意力机制.

On Layer Normalization in the Transformer Architecture

Transformer结构中的层归一化.

A^2-Nets: Double Attention Networks

A^2-Net:双重注意力网络.