Chapter 1 Introduction

Author: Nadja Sauter

Supervisor: Matthias Aßenmacher

1.1 Introduction to Multimodal Deep Learning

There are five basic human senses: hearing, touch, smell, taste and sight. Possessing these five modalities, we are able to perceive and understand the world around us. Thus, “multimodal” means to combine different channels of information simultaneously to understand our surroundings. For example, when toddlers learn the word “cat”, they use different modalities by saying the word out loud, pointing on cats and making sounds like “meow”. Using the human learning process as a role model, artificial intelligence (AI) researchers also try to combine different modalities to train deep learning models. On a superficial level, deep learning algorithms are based on a neural network that is trained to optimize some objective which is mathematically defined via the so-called loss function. The optimization, i.e. minimizing the loss, is done via a numerical procedure called gradient descent. Consequently, deep learning models can only handle numeric input and can only result in a numeric output. However, in multimodal tasks we are often confronted with unstructured data like pictures or text. Thus, the first major problem is how to represent the input numerically. The second issue with regard to multimodal tasks is how exactly to combine different modalities. For instance, a typical task could be to train a deep learning model to generate a picture of a cat. First of all, the computer needs to understand the text input “cat” and then somehow translate this information into a specific image. Therefore, it is necessary to identify the contextual relationships between words in the text input and the spatial relationships betweent pixels in the image output. What might be easy for a toddler in pre-school, is a huge challenge for the computer. Both have to learn some understanding of the word “cat” that comprises the meaning and appearance of the animal. A common approach in modern deep learning is to generate embeddings that represent the cat numerically as a vector in some latent space. However, to achieve this, different approaches and algorithmic architectures have been developed in recent years. This book gives an overview of the different methods used in state-of-the-art (SOTA) multimodal deep learning to overcome challenges arising from unstructured data and combining inputs of different modalities.

1.2 Outline of the Booklet

Since multimodal models often use text and images as input or output, methods of Natural Language Processing (NLP) and Computer Vision (CV) are introduced as foundation in Chapter 2. Methods in the area of NLP try to handle text data, whereas CV deals with image processing. With regard to NLP (subsection 2.1), one concept of major importance is the so-called word embedding, which is nowadays an essential part of (nearly) all multimodal deep learning architectures. This concept also sets the foundation for transformer-based models like BERT (Devlin et al. 2018a), which achieved a huge improvement in several NLP tasks. Especially the (self-)attention mechanism (Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, et al. 2017a) of transformers revolutionized NLP models, which is why most of them rely on the transformer as a backbone. In Computer Vision (subsection 2.2) different network architectures, namely ResNet (He et al. 2015), EfficientNet (M. Tan and Le 2019a), SimCLR (T. Chen, Kornblith, Norouzi, and Hinton 2020a) and BYOL (Grill, Strub, Altché, et al. 2020), will be introduced. In both fields it is of great interest to compare the different approaches and their performance on challenging benchmarks. For this reason, the last subsection 2.3 of Chapter 2 gives an overall overview of different data sets, pre-training tasks and benchmarks for CV as well as for NLP.

The second Chapter (see 3) focuses on different multimodal architectures, covering a wide variety of how text and images can be combined. The presented models combine and advance different methods of NLP and CV. First of all, looking at Img2Text tasks (subsection 3.1), the data set Microsoft COCO for object recognition (T.-Y. Lin, Maire, Belongie, Bourdev, et al. 2014) and the meshed-memory transformer for Image Captioning (M2 Transformer) (Cornia et al. 2019) will be presented. Contrariwise, researchers developed methods to generate pictures based on a short text prompt (subsection 3.2). The first models accomplishing this task were generative adversarial networks (GANs) (Ian J. Goodfellow et al. 2014a) and Variational Autoencoders (VAEs) (Kingma and Welling 2019). These methods were improved in recent years and today’s SOTA transformer architectures and text-guided diffusion models like DALL-E (Ramesh, Pavlov, et al. 2021a) and GLIDE (Nichol et al. 2021a) achieve remarkable results. Another interesting question is how images can be utilized to support language models (subsection 3.3). This can be done via sequential embeddings, more advanced grounded embeddings or, again, inside transformers. On the other hand, one can also look at text supporting CV models like CLIP (Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, et al. 2021a), ALIGN (C. Jia, Yang, Xia, Chen, et al. 2021b) and Florence (Yuan et al. 2021) (subsection 3.4). They use foundation models meaning reusing models (e.g. CLIP inside DALL-E 2) as well as a contrastive loss for connecting text with images. Besides, zero-shooting makes it possible to classify new and unseen data without expensive fine-tuning. Especially the open-source architecture CLIP (Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark, Krueger, et al. 2021a) for image classification and generation attracted a lot of attention last year. In the end of the second chapter, some further architectures to handle text and images simultaneously are introduced (subsection 3.5). For instance, Data2Vec uses the same learning method for speech, vision and language and in this way aims to find a general approach to handle different modalities in one architecture. Furthermore, VilBert (J. Lu et al. 2019a) extends the popular BERT architecture to handle both image and text as input by implementing co-attention. This method is also used in Google’s Deepmind Flamingo (Alayrac et al. 2022). In addition, Flamingo aims to tackle multiple tasks with a single visual language model via few-shot learning and freezing the pre-trained vision and language model.

In the last chapter (see 4), methods are introduced that are also able to handle modalities other than text and image, like e.g. video, speech or tabular data. The overall goal here is to find a general multimodal architecture based on challenges rather than modalities. Therefore, one needs to handle problems of multimodal fusion and alignment and decide whether you use a join or coordinated representation (subsection 4.1). Moreover we go more into detail about how exactly to combine structured and unstructured data (subsection 4.2). Therefore, different fusion strategies which evolved in recent years will be presented. This is illustrated in this book by two use cases in survival analysis and economics. Besides this, another interesting research question is how to tackle different tasks in one so called multi-purpose model (subsection 4.3) like it is intended to be created by Google researchers (Barham et al. 2022) in their “Pathway” model. Last but not least, we show one exemplary application of Multimodal Deep Learning in the arts scene where image generation models like DALL-E (Ramesh, Pavlov, et al. 2021a) are used to create art pieces in the area of Generative Arts (subsection 4.4).

References

Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, et al. 2022. “Flamingo: A Visual Language Model for Few-Shot Learning.” arXiv Preprint arXiv:2204.14198.

Barham, Paul, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, et al. 2022. “Pathways: Asynchronous Distributed Dataflow for Ml.” arXiv. https://doi.org/10.48550/ARXIV.2203.12533.

Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020a. “A Simple Framework for Contrastive Learning of Visual Representations.” arXiv. https://doi.org/10.48550/ARXIV.2002.05709.

Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2019. “Meshed-Memory Transformer for Image Captioning.” arXiv. https://doi.org/10.48550/ARXIV.1912.08226.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018a. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv. https://doi.org/10.48550/ARXIV.1810.04805.

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014a. “Generative Adversarial Networks.” arXiv. https://doi.org/10.48550/ARXIV.1406.2661.

Grill, Jean-Bastien, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, et al. 2020. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” arXiv. https://doi.org/10.48550/ARXIV.2006.07733.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” CoRR. http://arxiv.org/abs/1512.03385.

Jia, Chao, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021b. “Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision.” CoRR. https://arxiv.org/abs/2102.05918.

Kingma, Diederik P., and Max Welling. 2019. “An Introduction to Variational Autoencoders.” CoRR. http://arxiv.org/abs/1906.02691.

Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. “Microsoft Coco: Common Objects in Context.” arXiv. https://doi.org/10.48550/ARXIV.1405.0312.

Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019a. “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.” arXiv. https://doi.org/10.48550/ARXIV.1908.02265.

Nichol, Alex, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021a. “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.” arXiv. https://doi.org/10.48550/ARXIV.2112.10741.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021a. “Learning Transferable Visual Models from Natural Language Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2103.00020.

Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021a. “Zero-Shot Text-to-Image Generation.” arXiv. https://doi.org/10.48550/ARXIV.2102.12092.

Tan, Mingxing, and Quoc V. Le. 2019a. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” https://doi.org/10.48550/ARXIV.1905.11946.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017a. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Yuan, Lu, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, et al. 2021. “Florence: A New Foundation Model for Computer Vision.” arXiv Preprint arXiv:2111.11432.