Link Search Menu Expand Document

Course Schedule

Paper reading list and presenters

Session 1
Introduction and Motivation (May 24, 19:00 - 19:50)
Boqing Gong, Chen Sun
  1. Motivation Slides
  2. Introduction Prismia
Session 2
Recurrent Networks, Attention, Transformers (May 24, 20:00 - 21:30)
Boqing Gong, Chen Sun
  1. RNN, Attention Prismia
  2. Transformers Slides
  3. The Annotated Transformer
  4. Attention Is All You Need
  5. Neural Machine Translation by Jointly Learning to Align and Translate
Session 3
Transformers for Vision and Long Sequences (May 24, 21:30 - 23:00)
Boqing Gong, Chen Sun
  1. Vision Transformer Prismia
  2. Transformer for Long Sequences Slides
  3. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  4. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  5. ViViT: A Video Vision Transformer
  6. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  7. Big Bird: Transformers for Longer Sequences
  8. Long Range Arena: A Benchmark for Efficient Transformers
Session 4
Optimization for Transformers (May 25, 19:00 - 19:50)
Boqing Gong
  1. Prismia
  2. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
  3. Surrogate Gap Minimization Improves Sharpness-Aware Training
Session 5
Transformers for Decision Making (May 25, 20:00 - 20:50)
Chen Sun
  1. Slides
  2. Recording
  3. Decision Transformer: Reinforcement Learning via Sequence Modeling
  4. Offline Reinforcement Learning as One Big Sequence Modeling Problem
  5. VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation
  6. Episodic Transformer for Vision-and-Language Navigation
Session 6
Multimodal Transformers (May 26, 19:00 - 20:50)
Boqing Gong, Chen Sun
  1. Chen’s slides
  2. Boqing’s slides
  3. Recording
  4. Attention Bottlenecks for Multimodal Fusion
  5. VideoBERT: A Joint Model for Video and Language Representation Learning
  6. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
  7. CLIP: Connecting Text and Images
  8. Learning Temporal Dynamics from Cycles in Narrated Video
Session 7
Model Interpretability (May 26, 21:00 - 21:50)
Chen Sun
  1. Slides
  2. Recording
  3. A Primer in BERTology: What we know about how BERT works
  4. BERT Rediscovers the Classical NLP Pipeline
  5. Do Vision-Language Pretrained Models Learn Primitive Concepts?
  6. Does Vision-and-Language Pretraining Improve Lexical Grounding?
Session 8
Advanced Topics, Recap (May 26, 22:00 - 22:50)
Boqing Gong
  1. Slides
  2. MLP-Mixer: An all-MLP Architecture for Vision