- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Sequence Classification with Human Attention
- Phrase-Based & Neural Unsupervised Machine Translation
- What you can cram into a single vector: Probing sentence embeddings for linguistic properties
- SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
- Deep contextualized word representations
- Meta-Learning for Low-Resource Neural Machine Translation
- Linguistically-Informed Self-Attention for Semantic Role Labeling
- A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
- Know What You Don’t Know: Unanswerable Questions for SQuAD
- An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
- Universal Language Model Fine-tuning for Text Classification
- Improving Language Understanding by Generative Pre-Training
- Dissecting Contextual Word Embeddings: Architecture and Representation