Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

T2T-ViT:在ImageNet上从头开始训练视觉Transformer.

Going deeper with Image Transformers

CaiT:更深的视觉Transformer.

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

ConvNeXt V2: 使用MAE协同设计和扩展卷积网络.

DeepViT: Towards Deeper Vision Transformer

DeepViT:构建更深的视觉Transformer.

Training data-efficient image transformers & distillation through attention

DeiT:通过注意力蒸馏训练数据高效的视觉Transformer.

Better plain ViT baselines for ImageNet-1k

在ImageNet-1k数据集上更好地训练视觉Transformer.