Do We Really Need Explicit Position Encodings for Vision Transformers?

视觉Transformer真的需要显式位置编码吗?

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

VT:基于Token的图像表示和处理.

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

LeViT:以卷积网络的形式进行快速推理的视觉Transformer.

Escaping the Big Data Paradigm with Compact Transformers

CCT:使用紧凑的Transformer避免大数据依赖.

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

T2T-ViT:在ImageNet上从头开始训练视觉Transformer.

Going deeper with Image Transformers

CaiT:更深的视觉Transformer.