Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets Grokking:在小规模算术数据集上的过拟合外泛化.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher 扩展语言模型:训练 Gopher 的方法、分析和见解.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Swin Transformer: 基于移动窗口的分层视觉Transformer.