Averaging Weights Leads to Wider Optima and Better Generalization

SWA:通过随机权重平均寻找更宽的极小值.

Decoupled Weight Decay Regularization

AdamW:解耦梯度下降与权重衰减正则化.

ULSAM: Ultra-Lightweight Subspace Attention Module for Compact Convolutional Neural Networks

ULSAM:超轻量级子空间注意力机制.

On Layer Normalization in the Transformer Architecture

Transformer结构中的层归一化.

A^2-Nets: Double Attention Networks

A^2-Net:双重注意力网络.

Neural Architecture Search for Lightweight Non-Local Networks

轻量级非局部网络的神经结构搜索.