Link Search Menu Expand Document

Transformer Architectures for Multimodal Signal Processing and Decision Making

Time and Location

Instructors: Chen Sun and Boqing Gong


  • Tuesday, 24 May 2022, 19:00 - 23:00 (UTC+8)
  • Wednesday, 25 May 2022, 19:00 - 21:00 (UTC+8)
  • Thursday, 26 May 2022, 19:00 - 23:00 (UTC+8)


The Transformer neural architectures have become the de-facto model of choice in natural language processing (NLP). In computer vision, there has recently been a surge of interest in end-to-end Transformers, prompting the efforts to replace hand-wired features or inductive biases with general-purpose neural architectures powered by data-driven training. The Transformer architectures have also arrived at state-of-the-art performance in multimodal learning, protein structure prediction, decision making, and so on.

These results indicate the Transformer architectures’ great potential beyond the previously mentioned domains and in the signal processing (SP) community. We envision these efforts may lead to a unified knowledge base that produces versatile representations for different data modalities, simplifying the inference and deployment of deep learning models in various application scenarios. Hence, we believe it is timely to provide a short course on the Transformer architectures and related learning algorithms.

In this short course, we plan to provide a deep dive into these neural architectures, understand how they work, and focus on their impacts on self-supervised learning, a technique that trains machine learning models without requiring labeled data, and multimodal learning, a technique that leverages multiple input sources, such as vision, audio, and text. We will also study recent attempts to interpret these models, thus revealing potential risks on model bias. This course aims at providing the audience with knowledge about the Transformer neural architectures and related learning algorithms so that they can apply them to their own research and further advance their state of the arts.

Learning Goals

We anticipate students will:

  • Become familiar with self-attention and other building blocks of Transformers, the vanilla Transformer architecture, and its variations
  • Learn about Transformers’ applications in computer vision and natural language processing: ViT, Swin-Transformers, BERT, GPT-3, etc.
  • Understand supervised, self-supervised, and multimodal self-supervised learn algorithms to train a Transformer
  • Acquire visualization methods to inspect a Transformer
  • Learn advanced topics: related neural architectures (e.g., MLP-Mixer), applications in visual navigation, decision Transformers, etc.




Paper reading list and presenters

Session 1
Introduction and Motivation (May 24, 19:00 - 19:50)
Boqing Gong, Chen Sun
Session 2
Recurrent Networks, Attention, Transformers (May 24, 20:00 - 21:30)
Boqing Gong, Chen Sun
  1. The Annotated Transformer
  2. Attention Is All You Need
  3. Neural Machine Translation by Jointly Learning to Align and Translate
Session 3
Transformers for Vision and Long Sequences (May 24, 21:30 - 23:00)
Boqing Gong, Chen Sun
  1. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
  2. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  3. ViViT: A Video Vision Transformer
  4. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
  5. Big Bird: Transformers for Longer Sequences
  6. Long Range Arena: A Benchmark for Efficient Transformers
Session 4
Optimization for Transformers (May 25, 19:00 - 19:50)
Boqing Gong
  1. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
  2. Surrogate Gap Minimization Improves Sharpness-Aware Training
Session 5
Transformers for Decision Making (May 25, 20:00 - 20:50)
Chen Sun
  1. Decision Transformer: Reinforcement Learning via Sequence Modeling
  2. Offline Reinforcement Learning as One Big Sequence Modeling Problem
  3. VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation
  4. Episodic Transformer for Vision-and-Language Navigation
Session 6
Multimodal Transformers (May 26, 19:00 - 20:50)
Boqing Gong, Chen Sun
  1. Attention Bottlenecks for Multimodal Fusion
  2. VideoBERT: A Joint Model for Video and Language Representation Learning
  3. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
  4. CLIP: Connecting Text and Images
  5. Learning Temporal Dynamics from Cycles in Narrated Video
Session 7
Model Interpretability (May 26, 21:00 - 21:50)
Chen Sun
  1. A Primer in BERTology: What we know about how BERT works
  2. BERT Rediscovers the Classical NLP Pipeline
  3. Do Vision-Language Pretrained Models Learn Primitive Concepts?
  4. Does Vision-and-Language Pretraining Improve Lexical Grounding?
Session 8
Advanced Topics, Recap (May 26, 22:00 - 22:50)
Boqing Gong
  1. MLP-Mixer: An all-MLP Architecture for Vision