BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT:从Transformer中获得上下文的编码表示.

Deep contextualized word representations

ELMo:使用语言模型进行词嵌入.

Deformable DETR: Deformable Transformers for End-to-End Object Detection

Deformable DETR:使用多尺度可变形的注意力模块进行目标检测.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ViT:使用图像块序列的Transformer进行图像分类.

Generative Pretraining from Pixels

iGPT:像素级的图像预训练模型.

Do We Need Zero Training Loss After Achieving Zero Training Error?

Flooding:避免训练损失为0.