Transformer Architectures for Multimodal Signal Processing and Decision Making
Time and Location
Instructors: Chen Sun and Boqing Gong
Sessions:
- Tuesday, 24 May 2022, 19:00 - 23:00 (UTC+8)
- Wednesday, 25 May 2022, 19:00 - 21:00 (UTC+8)
- Thursday, 26 May 2022, 19:00 - 23:00 (UTC+8)
About
The Transformer neural architectures have become the de-facto model of choice in natural language processing (NLP). In computer vision, there has recently been a surge of interest in end-to-end Transformers, prompting the efforts to replace hand-wired features or inductive biases with general-purpose neural architectures powered by data-driven training. The Transformer architectures have also arrived at state-of-the-art performance in multimodal learning, protein structure prediction, decision making, and so on.
These results indicate the Transformer architectures’ great potential beyond the previously mentioned domains and in the signal processing (SP) community. We envision these efforts may lead to a unified knowledge base that produces versatile representations for different data modalities, simplifying the inference and deployment of deep learning models in various application scenarios. Hence, we believe it is timely to provide a short course on the Transformer architectures and related learning algorithms.
In this short course, we plan to provide a deep dive into these neural architectures, understand how they work, and focus on their impacts on self-supervised learning, a technique that trains machine learning models without requiring labeled data, and multimodal learning, a technique that leverages multiple input sources, such as vision, audio, and text. We will also study recent attempts to interpret these models, thus revealing potential risks on model bias. This course aims at providing the audience with knowledge about the Transformer neural architectures and related learning algorithms so that they can apply them to their own research and further advance their state of the arts.
Learning Goals
We anticipate students will:
- Become familiar with self-attention and other building blocks of Transformers, the vanilla Transformer architecture, and its variations
- Learn about Transformers’ applications in computer vision and natural language processing: ViT, Swin-Transformers, BERT, GPT-3, etc.
- Understand supervised, self-supervised, and multimodal self-supervised learn algorithms to train a Transformer
- Acquire visualization methods to inspect a Transformer
- Learn advanced topics: related neural architectures (e.g., MLP-Mixer), applications in visual navigation, decision Transformers, etc.
Resources
Pre-Reading
- Convolutional neural networks
- Neural machine translation with recurrent neural networks and attention
- Attention is all you need
- Transformers for image recognition at scale
Syllabus
Paper reading list and presenters
- Session 1
- Introduction and Motivation (May 24, 19:00 - 19:50)
- Boqing Gong, Chen Sun
- Session 2
- Recurrent Networks, Attention, Transformers (May 24, 20:00 - 21:30)
- Boqing Gong, Chen Sun
- Session 3
- Transformers for Vision and Long Sequences (May 24, 21:30 - 23:00)
- Boqing Gong, Chen Sun
- Vision Transformer Prismia
- Transformer for Long Sequences Slides
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- ViViT: A Video Vision Transformer
- Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
- Big Bird: Transformers for Longer Sequences
- Long Range Arena: A Benchmark for Efficient Transformers
- Session 4
- Optimization for Transformers (May 25, 19:00 - 19:50)
- Boqing Gong
- Session 5
- Transformers for Decision Making (May 25, 20:00 - 20:50)
- Chen Sun
- Session 6
- Multimodal Transformers (May 26, 19:00 - 20:50)
- Boqing Gong, Chen Sun
- Chen’s slides
- Boqing’s slides
- Recording
- Attention Bottlenecks for Multimodal Fusion
- VideoBERT: A Joint Model for Video and Language Representation Learning
- VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
- CLIP: Connecting Text and Images
- Learning Temporal Dynamics from Cycles in Narrated Video
- Session 7
- Model Interpretability (May 26, 21:00 - 21:50)
- Chen Sun
- Session 8
- Advanced Topics, Recap (May 26, 22:00 - 22:50)
- Boqing Gong