Chapter 5: Post-BERT Era

Creating BERT-based models with modifications to pretraining involves adjusting the pretraining objectives or architecture to suit specific tasks or domains.This process typically begins by designing custom pre-training objectives or modifying existing ones to capture domain-specific characteristics or improve model performance on targeted tasks. These modified pre-training objectives can include variations of masked language modeling (MLM), next sentence prediction (NSP), or other self-supervised learning tasks tailored to the needs of the target domain. After pretraining, the model is fine-tuned on downstream tasks using task-specific data and objectives, enabling it to adapt its learned representations to the specific requirements of the tasks. In this chapter you will learn about three different cases where the existing BERT model has been modified, namely RoBERTa [1], ALBERT [2] and DistillBERT [3].

References

Chapter 05.01: Implications for future work & BERTology
BERT (Bidirectional Encoder Representations from Transformers) has significantly impacted research in natural language processing (NLP) by introducing the concept of contextualized word embeddings and demonstrating the effectiveness of large-scale pretraining followed by fine-tuning on downstream tasks. Prior to BERT, models like Word2vec and fasttext generated static word embeddings that lacked context, limiting their ability to capture the nuances of language. BERT’s bidirectional approach to pretraining allowed it to capture rich contextual information, leading to substantial improvements in performance across a wide range of NLP tasks. Additionally, the widespread adoption of BERT sparked a new area of research known as “BERTology,” which focuses on understanding the inner workings of transformer-based models like BERT through empirical analysis, ablation studies, and probing experiments. This research has led to deeper insights into the mechanisms underlying these models and has inspired further innovations in model architectures, pre-training objectives, and fine-tuning strategies in NLP.
Chapter 05.02: BERT-based architectures
For RoBERTa (Robustly optimized BERT approach), the basic idea is to improve upon the BERT architecture by training on more data with longer sequences, removing the next sentence prediction (NSP) objective, and utilizing dynamic masking during pretraining to enhance robustness and performance. RoBERTa achieves this by pretraining on a larger corpus of text data with more training steps, larger batch sizes, and longer sequences compared to BERT. Additionally, it removes the NSP objective and employs dynamic masking, where masking patterns are dynamically sampled for each training instance, allowing the model to see different masked tokens across multiple epochs. This approach enhances the model’s ability to capture contextual information and improves performance on various natural language understanding tasks. As for ALBERT (A Lite BERT), the basic idea is to reduce the computational complexity of BERT while maintaining or improving performance by introducing parameter-sharing techniques and factorized embedding parameterization. ALBERT achieves this by factorizing the embedding parameters and sharing them across layers, reducing the number of parameters and computational cost. These innovations enable ALBERT to achieve similar or better performance than BERT while being more memory-efficient and scalable, making it suitable for a wider range of applications and deployment scenarios.
Chapter 05.03: Model distillation
For model distillation, the basic idea is to transfer knowledge from a large, computationally expensive model (e.g., BERT) to a smaller, more efficient model (e.g., DistilBERT) by distilling the large model’s knowledge into the smaller one through a process called distillation. DistilBERT, as an example, achieves this by first pretraining a large BERT model on a large corpus of text data using standard pre-training objectives. Then, the knowledge learned by the large model is transferred to a smaller, distilled model, such as DistilBERT, by training the smaller model to mimic the behavior of the larger model. This is typically done by minimizing the discrepancy between the probability distributions of the predictions made by the large model and the distilled model on the same input data. During this process, unnecessary parameters and redundant information are discarded, resulting in a smaller and more efficient model while retaining much of the performance of the larger model. This allows for faster inference and deployment in resource-constrained environments without sacrificing much in terms of performance. DistilBERT, specifically, achieves a significant reduction in model size and computational cost compared to BERT while maintaining competitive performance across various NLP tasks.