An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale ViT:使用图像块序列的Transformer进行图像分类.