Deep Learning for Natural Language Processing (DL4NLP) | Chapter 6: Post-BERT Era II and using the Transformer

Here we further introduce models from the Post-BERT era, such as ELECTRA and XLNet. Additionally, we explore the concept of restructuring tasks into a text-to-text format and present the T5 model as a prime example.

Chapter 06.01: Post-BERT architectures
This chapter will introduce two new architectures from the post-BERT era, namely ELECTRA [1] and XLNet [2]. By changing the pre training approach we can create new models. For ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), the basic idea is to train a discriminator model to distinguish between the original tokens in a text sequence and replaced tokens generated by a generator model, aiming to create a more efficient pretraining approach compared to masked language modeling (MLM). For XLNet, the basic idea is to overcome the limitations of unidirectional and bidirectional language models by introducing a permutation-based pre-training objective, the so called permutation language modeling (PLM), that enables the model to consider all possible permutations of the input tokens, capturing bidirectional context.
Chapter 06.02: Tasks as text-to-text problem
Reformulating various NLP tasks as text-to-text tasks aims to simplify model architectures and improve performance by treating all tasks as instances of generating output text from input text. This approach addresses shortcomings of BERT’s original design, where different tasks required different output layers and training objectives, leading to a complex multitask learning setup. By unifying tasks under a single text-to-text framework, models can be trained more efficiently and generalize better across diverse tasks and domains.
Chapter 06.03: Text-to-Text Transfer Transformer
T5 (Text-To-Text Transfer Transformer) [1] aims to unify various natural language processing tasks by framing them all as text-to-text transformations, simplifying model architectures and enabling flexible training across diverse tasks. It achieves this by formulating input-output pairs for different tasks as text sequences, allowing the model to learn to generate target text from source text regardless of the specific task, facilitating multitask learning and transfer learning across tasks with a single, unified architecture.