ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

ViLT:无卷积或区域监督的视觉语言Transformer.

Multimodal Few-Shot Learning with Frozen Language Models

冻结语言模型的多模态少样本学习.

Unifying Vision-and-Language Tasks via Text Generation

通过文本生成统一视觉和语言任务.

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

ImageBERT:使用大规模弱监督图像文本数据进行跨模态预训练.

浅评《回声》:衍生的衍生,难有回声

A Brief Review of Echo: Derived Derivatives, Difficult to Echo.

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

Pixel-BERT:使用深度多模态Transformer对齐图像像素和文本.