ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks ViLBERT:用于视觉和语言任务的无任务特定的视觉语言表示的预训练.
LXMERT: Learning Cross-Modality Encoder Representations from Transformers LXMERT:学习Transformer中的跨模态编码表示.