Chapter 06.01: Post-BERT architectures

This chapter will introduce two new architectures from the post-BERT era, namely ELECTRA [1] and XLNet [2]. By changing the pre training approach we can create new models. For ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), the basic idea is to train a discriminator model to distinguish between the original tokens in a text sequence and replaced tokens generated by a generator model, aiming to create a more efficient pretraining approach compared to masked language modeling (MLM). For XLNet, the basic idea is to overcome the limitations of unidirectional and bidirectional language models by introducing a permutation-based pre-training objective, the so called permutation language modeling (PLM), that enables the model to consider all possible permutations of the input tokens, capturing bidirectional context.

Lecture Slides

References

Additional Resources