Course Schedule
Paper reading list and presenters
- Session 1
- Introduction and Motivation (May 24, 19:00 - 19:50)
- Boqing Gong, Chen Sun
- Motivation Slides
- Introduction Prismia
- Session 2
- Recurrent Networks, Attention, Transformers (May 24, 20:00 - 21:30)
- Boqing Gong, Chen Sun
- RNN, Attention Prismia
- Transformers Slides
- The Annotated Transformer
- Attention Is All You Need
- Neural Machine Translation by Jointly Learning to Align and Translate
- Session 3
- Transformers for Vision and Long Sequences (May 24, 21:30 - 23:00)
- Boqing Gong, Chen Sun
- Vision Transformer Prismia
- Transformer for Long Sequences Slides
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- ViViT: A Video Vision Transformer
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Big Bird: Transformers for Longer Sequences
- Long Range Arena: A Benchmark for Efficient Transformers
- Session 4
- Optimization for Transformers (May 25, 19:00 - 19:50)
- Boqing Gong
- Prismia
- When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
- Surrogate Gap Minimization Improves Sharpness-Aware Training
- Session 5
- Transformers for Decision Making (May 25, 20:00 - 20:50)
- Chen Sun
- Slides
- Recording
- Decision Transformer: Reinforcement Learning via Sequence Modeling
- Offline Reinforcement Learning as One Big Sequence Modeling Problem
- VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation
- Episodic Transformer for Vision-and-Language Navigation
- Session 6
- Multimodal Transformers (May 26, 19:00 - 20:50)
- Boqing Gong, Chen Sun
- Chen’s slides
- Boqing’s slides
- Recording
- Attention Bottlenecks for Multimodal Fusion
- VideoBERT: A Joint Model for Video and Language Representation Learning
- VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
- CLIP: Connecting Text and Images
- Learning Temporal Dynamics from Cycles in Narrated Video
- Session 7
- Model Interpretability (May 26, 21:00 - 21:50)
- Chen Sun
- Slides
- Recording
- A Primer in BERTology: What we know about how BERT works
- BERT Rediscovers the Classical NLP Pipeline
- Do Vision-Language Pretrained Models Learn Primitive Concepts?
- Does Vision-and-Language Pretraining Improve Lexical Grounding?
- Session 8
- Advanced Topics, Recap (May 26, 22:00 - 22:50)
- Boqing Gong
- Slides
- MLP-Mixer: An all-MLP Architecture for Vision