Align before Fuse: Vision and Language Representation Learning with Momentum Distillation 融合前对齐:使用动量蒸馏进行视觉和语言表示学习.
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision ViLT:无卷积或区域监督的视觉语言Transformer.