Transformer Architectures for Multimodal Signal Processing and Decision Making

Name: Just the Class
Author: Chen Sun

Time and Location

Instructors: Chen Sun and Boqing Gong

Sessions:

Tuesday, 24 May 2022, 19:00 - 23:00 (UTC+8)
Wednesday, 25 May 2022, 19:00 - 21:00 (UTC+8)
Thursday, 26 May 2022, 19:00 - 23:00 (UTC+8)

About

The Transformer neural architectures have become the de-facto model of choice in natural language processing (NLP). In computer vision, there has recently been a surge of interest in end-to-end Transformers, prompting the efforts to replace hand-wired features or inductive biases with general-purpose neural architectures powered by data-driven training. The Transformer architectures have also arrived at state-of-the-art performance in multimodal learning, protein structure prediction, decision making, and so on.

These results indicate the Transformer architectures’ great potential beyond the previously mentioned domains and in the signal processing (SP) community. We envision these efforts may lead to a unified knowledge base that produces versatile representations for different data modalities, simplifying the inference and deployment of deep learning models in various application scenarios. Hence, we believe it is timely to provide a short course on the Transformer architectures and related learning algorithms.

In this short course, we plan to provide a deep dive into these neural architectures, understand how they work, and focus on their impacts on self-supervised learning, a technique that trains machine learning models without requiring labeled data, and multimodal learning, a technique that leverages multiple input sources, such as vision, audio, and text. We will also study recent attempts to interpret these models, thus revealing potential risks on model bias. This course aims at providing the audience with knowledge about the Transformer neural architectures and related learning algorithms so that they can apply them to their own research and further advance their state of the arts.

Learning Goals

We anticipate students will:

Become familiar with self-attention and other building blocks of Transformers, the vanilla Transformer architecture, and its variations
Learn about Transformers’ applications in computer vision and natural language processing: ViT, Swin-Transformers, BERT, GPT-3, etc.
Understand supervised, self-supervised, and multimodal self-supervised learn algorithms to train a Transformer
Acquire visualization methods to inspect a Transformer
Learn advanced topics: related neural architectures (e.g., MLP-Mixer), applications in visual navigation, decision Transformers, etc.

Resources

Pre-Reading

Syllabus

Paper reading list and presenters

Session 1

Introduction and Motivation (May 24, 19:00 - 19:50): Boqing Gong, Chen Sun

Session 2

Recurrent Networks, Attention, Transformers (May 24, 20:00 - 21:30): Boqing Gong, Chen Sun

Session 3

Transformers for Vision and Long Sequences (May 24, 21:30 - 23:00): Boqing Gong, Chen Sun

Session 4

Optimization for Transformers (May 25, 19:00 - 19:50): Boqing Gong

Session 5

Transformers for Decision Making (May 25, 20:00 - 20:50): Chen Sun

Session 6

Multimodal Transformers (May 26, 19:00 - 20:50): Boqing Gong, Chen Sun

Session 7

Model Interpretability (May 26, 21:00 - 21:50): Chen Sun

Session 8

Advanced Topics, Recap (May 26, 22:00 - 22:50): Boqing Gong