GIT: A Generative Image-to-text Transformer for Vision and Language

GIT:视觉和语言的通用图像到文本Transformer.

VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

VLMo:使用模态混合专家的统一视觉语言预训练.

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

融合前对齐:使用动量蒸馏进行视觉和语言表示学习.

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

ViLT:无卷积或区域监督的视觉语言Transformer.

Multimodal Few-Shot Learning with Frozen Language Models

冻结语言模型的多模态少样本学习.

Unifying Vision-and-Language Tasks via Text Generation

通过文本生成统一视觉和语言任务.