Chapter 3: Transformer
The Transformer, as introduced in [1], is a deep learning model architecture specifically designed for sequence-to-sequence tasks in natural language processing. It revolutionizes NLP by replacing recurrent layers with self-attention mechanisms, enabling it to process entire sequences in parallel, overcoming the limitations of sequential processing in traditional RNN-based models like LSTMs. This architecture has become the foundation for state-of-the-art models in various NLP tasks such as machine translation, text summarization, and language understanding. In this chapter we first introduce the transformer, explore different parts of it (Encoder and Decoder) and finally discuss ways to improve the architecture, such as Transformer-XL and Efficient Transformers.
References
Additional Resources
- Very good video explaining the Transformer and Attention
- 3Blue1Brown Video series about the Transformer
-
Chapter 03.01: A universal deep learning architecture
Transformers have been adapted and applied to various domains and tasks in addition to traditional sequence-to-sequence tasks in NLP. This chapter mentions a few examples of models that apply the transformer architecture to various domains. Examples include: Vision Transformer (ViT) [1]: Utilizes transformer architecture for image classification tasks, demonstrating competitive performance compared to convolutional neural networks (CNNs). CLIP [2]: A model that connects images and text through a unified embedding space, enabling tasks such as zero-shot image classification and image-text retrieval.
-
Chapter 03.02: The Encoder
The Encoder in a transformer model is responsible for processing the input sequence and generating contextualized representations of each token, capturing both local and global dependencies within the sequence. It achieves this by employing self-attention mechanisms, which allow each token to attend to all other tokens in the input sequence, enabling the model to capture relationships and dependencies between tokens regardless of their positions. Additionally, the encoder includes position-wise feedforward networks to further refine the representations and incorporate positional information.
-
Chapter 03.03: The Decoder
The Decoder in a transformer model is responsible for generating an output sequence based on the contextualized representations generated by the encoder, facilitating tasks such as sequence generation and machine translation. It achieves this by utilizing self-attention mechanisms, similar to the encoder, to capture dependencies within the input sequence and cross-attention mechanisms to attend to the Encoder-output, enabling the model to focus on relevant parts of the input during decoding. Additionally, the decoder includes position-wise feedforward networks to further refine the representations and generate the output sequence token by token.
-
Chapter 03.04: Transformer Parameter Count
This chapter deals with the number of parameters of the transformer. The parameter count of a transformer model refers to the total number of learnable parameters present in its architecture, which are distributed across various components of the model. These components typically include: Embedding Layers: Parameters associated with the input and output embeddings for tokens, which encode their semantic meanings. Encoder Layers: Parameters within each encoder layer, including those associated with self-attention mechanisms, position-wise feedforward networks, and layer normalization. Decoder Layers: Parameters within each decoder layer, including self-attention mechanisms, cross-attention mechanisms, position-wise feedforward networks, and layer normalization. Positional Encodings: Parameters used to encode positional information in the input sequences. The total parameter count of a transformer model is the sum of parameters from all these components, with variations depending on the specific architecture and hyperparameters chosen for the model.
-
Chapter 03.05: Long Sequences: Transformer-XL
This chapter is about the Transformer-XL [1] and how it deals with the issue of long sequences. Transformer-XL is an extension of the original Transformer architecture designed to address the limitations of long-range dependency modeling in sequence-to-sequence tasks. It aims to solve the problem of capturing and retaining information over long sequences by introducing a segment-level recurrence mechanism, enabling the model to process sequences of arbitrary length without being constrained by fixed-length contexts or running into computational limitations. Additionally, Transformer-XL incorporates relative positional embeddings to better capture positional information across segments of varying lengths.
-
Chapter 03.06: Efficient Transformers
Efficient Transformers are designed to mitigate the computational and memory requirements of standard transformer architectures, particularly when dealing with large-scale datasets or resource-constrained environments. They aim to address issues such as scalability and efficiency in training and inference. One approach used in efficient transformers is replacing the standard self-attention mechanism with more lightweight attention mechanisms, which reduce the computational complexity of attending to long sequences by approximating the attention mechanism with lower-rank matrices or restricting attention to local or sparse regions of the sequence. These approaches enable transformers to be more practical for real-world applications where computational resources are limited.