Align before Fuse: Vision and Language Representation Learning with Momentum Distillation 融合前对齐:使用动量蒸馏进行视觉和语言表示学习.