Chapter 03.01: A universal deep learning architecture

Transformers have been adapted and applied to various domains and tasks in addition to traditional sequence-to-sequence tasks in NLP. This chapter mentions a few examples of models that apply the transformer architecture to various domains. Examples include: Vision Transformer (ViT) [1]: Utilizes transformer architecture for image classification tasks, demonstrating competitive performance compared to convolutional neural networks (CNNs). CLIP [2]: A model that connects images and text through a unified embedding space, enabling tasks such as zero-shot image classification and image-text retrieval.

Lecture Slides

References