CoCa: Contrastive Captioners are Image-Text Foundation Models

CoCa:对比描述器是图像文本基础模型.

VinVL: Revisiting Visual Representations in Vision-Language Models

VinVL:重新回归视觉语言模型中的视觉表示.

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

SimVLM:弱监督的简单视觉语言模型预训练.

GIT: A Generative Image-to-text Transformer for Vision and Language

GIT:视觉和语言的通用图像到文本Transformer.

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo:使用模态混合专家的统一视觉语言预训练.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

融合前对齐:使用动量蒸馏进行视觉和语言表示学习.