ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data ImageBERT:使用大规模弱监督图像文本数据进行跨模态预训练.
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers Pixel-BERT:使用深度多模态Transformer对齐图像像素和文本.